JP2010191851A

JP2010191851A - Article feature word extraction device, article feature word extraction method and program

Info

Publication number: JP2010191851A
Application number: JP2009037684A
Authority: JP
Inventors: Megumi Makino; 恵牧野; Takeshi Masuyama; 毅司増山
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2009-02-20
Filing date: 2009-02-20
Publication date: 2010-09-02
Anticipated expiration: 2029-02-20
Also published as: JP5085584B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an article feature word extraction device for extracting a feature word for a title corresponding to an input article on the basis of named entity, and to provide an article feature word extraction method and program. <P>SOLUTION: The article feature word extraction device 1 for extracting a feature word for a title corresponding to an input article includes: an article input means 11 for receiving an input article with a category attached thereto; a similar article extracting means 12 for extracting a similar article having the same category as the input article and similar to the input article from a past article storing part 22 for storing the past articles classified into categories and titles in association with each other; a named entity extracting means 13 for extracting a named entity from the input article and a similar article title by using a named entity storing part 24 for storing the named entity; a named entity generalizing means 14 for generalizing the named entity and applying the named entity to the input article and the similar article title; a feature word extracting means 15 for extracting a feature word from the input article and the similar article title after the application of the generalized named entity; and an article output means 16 for outputting an input article in which the feature word can be identified. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力記事に対応するタイトル用の特徴語を抽出する記事特徴語抽出装置、記事特徴語抽出方法及びプログラムに関する。 The present invention relates to an article feature word extraction device, an article feature word extraction method, and a program for extracting feature words for a title corresponding to an input article.

従来、ユーザは、ネットワーク上でインターネットから様々な情報を得ることを行っている。インターネット上には大量の情報として、例えば、ニュース記事が蓄積されている。記事は、一般的に、カテゴリに分類され、また、その記事の内容が分かるようなタイトルに代表される比較的短い文書が付されている。記事に、カテゴリやタイトルが付されることで、ユーザは、欲する記事を効率よく得ることができる。しかし、日々発信されるニュース記事のタイトルを作成する作業は、煩雑である。そのため、これらの作業を補助する装置が開示されている（例えば、特許文献１）。 Conventionally, users obtain various information from the Internet on a network. For example, news articles are accumulated as a large amount of information on the Internet. Articles are generally classified into categories, and a relatively short document represented by a title is provided so that the contents of the article can be understood. By attaching categories and titles to the articles, the user can efficiently obtain the desired articles. However, the work of creating the title of a news article that is sent out every day is complicated. Therefore, an apparatus for assisting these operations is disclosed (for example, Patent Document 1).

特開２０００−２９８８２号公報JP 2000-29882 A

特許文献１に記載の装置は、例えば、対象文書及び同分野の文書を言語解析して、その結果を基に重要語や例示語等を判断して要約文を作成する。このようなプロセスで示された重要語は、同一分野の文書のうち、出現頻度の高いものである。よって、記事において、一般に、その内容の特徴を示す地名や人名等の固有表現は、出現頻度が低いので、固有表現は、重要語にはなりにくい。 For example, the apparatus described in Patent Document 1 performs linguistic analysis on a target document and a document in the same field, and based on the result, determines an important word, an example word, and the like, and creates a summary sentence. The important words shown in such a process are those having a high appearance frequency among documents in the same field. Therefore, in an article, since a specific expression such as a place name or a person name indicating the feature of the content has a low appearance frequency, the specific expression is not likely to be an important word.

本発明は、固有表現に基づいて、入力記事に対応するタイトル用の特徴語を抽出する記事特徴語抽出装置、記事特徴語抽出方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide an article feature word extraction device, an article feature word extraction method, and a program that extract feature words for a title corresponding to an input article based on a specific expression.

本発明者は、固有表現を汎化することで、類似記事のタイトルに一致する特徴語として固有表現を抽出する方法を見出し、本発明を完成するに至った。本発明は、具体的には次のようなものを提供する。 The present inventor has found a method of extracting a specific expression as a feature word that matches a similar article title by generalizing the specific expression, and has completed the present invention. Specifically, the present invention provides the following.

（１）入力記事に対応するタイトル用の特徴語を抽出する記事特徴語抽出装置において、
カテゴリが付された入力記事を受け付ける記事入力手段と、
前記カテゴリに分類された過去の記事にそのタイトルを対応付けて記憶した過去記事記憶部から、前記記事入力手段が受け付けた前記入力記事にカテゴリが一致し、前記入力記事に類似する類似記事を抽出する類似記事抽出手段と、
固有表現を記憶する固有表現記憶部を用いて、前記入力記事と、前記類似記事の前記タイトルとから、各々前記固有表現を抽出する固有表現抽出手段と、
前記固有表現抽出手段により抽出した前記固有表現を汎化して、前記入力記事及び前記類似記事の前記タイトルに汎化固有表現を適用する固有表現汎化手段と、
前記固有表現汎化手段による前記汎化固有表現を適用後の前記入力記事及び前記類似記事の前記タイトルから、特徴語を抽出する特徴語抽出手段と、
を備える記事特徴語抽出装置。 (1) In an article feature word extraction device that extracts feature words for a title corresponding to an input article,
An article input means for receiving an input article with a category;
A similar article similar to the input article whose category matches the input article received by the article input unit is extracted from the past article storage unit that stores the title in association with the past article. Similar article extraction means,
Using a specific expression storage unit for storing a specific expression, specific expression extraction means for extracting the specific expression from the input article and the title of the similar article;
A specific expression generalizing means for generalizing the specific expression extracted by the specific expression extracting means and applying a generalized specific expression to the title of the input article and the similar article;
Feature word extraction means for extracting a feature word from the title of the input article and the similar article after application of the generalized specific expression by the specific expression generalization means;
Article feature word extraction device.

本発明のこのような構成によれば、入力記事と同じカテゴリの過去の記事である類似記事のタイトルとに含む各々の固有表現を汎化して特徴語を抽出することで、記事の内容の特徴を示す地名や人名等の固有表現が抽出できる。ここで、汎化とは、例えば、タグ付けやキーワードによる置換等をいう。このように、記事により異なる固有表現を汎化することで、上位概念で同一に捉えることができ、より精緻な類似の抽出を行うことができる。なお、本明細書では以降、「タイトル」とは、サブタイトル、要約を含むものとする。 According to such a configuration of the present invention, the feature word is extracted by generalizing each unique expression included in the title of a similar article that is a past article in the same category as the input article, thereby extracting the feature word. It is possible to extract a unique expression such as a place name or a person name indicating. Here, generalization refers to, for example, tagging or keyword replacement. In this way, by generalizing different unique expressions depending on articles, it is possible to grasp the same in the superordinate concept and to extract more precise similarities. In the following description of the present specification, the “title” includes a subtitle and a summary.

（２）前記特徴語抽出手段により抽出した前記類似記事の前記タイトルの特徴語を、対応する前記入力記事の特徴語に置き換えて、前記入力記事に関する仮タイトルを作成して出力する仮タイトル出力手段を備える、
（１）に記載の記事特徴語抽出装置。 (2) Temporary title output means for creating and outputting a provisional title related to the input article by replacing the feature word of the title of the similar article extracted by the feature word extraction means with the feature word of the corresponding input article Comprising
The article feature word extraction device according to (1).

本発明のこのような構成によれば、類似記事のタイトルの特徴語を、それに対応する入力記事の特徴語に置き換えた入力記事の仮タイトルを作成して、作成した仮タイトルを出力するので、入力記事のタイトルの作成を半自動化でき、仮タイトルを出力できる。よって、ユーザによるタイトル作成の効率を向上することができる。 According to such a configuration of the present invention, the provisional title of the input article is created by replacing the feature word of the title of the similar article with the feature word of the corresponding input article, and the created temporary title is output. Creation of input article titles can be semi-automated, and temporary titles can be output. Therefore, the efficiency of title creation by the user can be improved.

（３）前記特徴語抽出手段は、前記入力記事と、前記類似記事の前記タイトルとに一致する単語、複合語及び文節を、前記特徴語として抽出する、
（１）又は（２）に記載の記事特徴語抽出装置。 (3) The feature word extraction unit extracts, as the feature word, a word, a compound word, and a phrase that match the input article and the title of the similar article.
The article feature word extraction device according to (1) or (2).

本発明のこのような構成によれば、入力記事と、類似記事のタイトルとに一致する単語、複合語や文節を抽出するので、記事に共通の単語、複合語や文節を抽出できる。 According to such a configuration of the present invention, words, compound words, and phrases that match the input article and the titles of similar articles are extracted, so that words, compound words, and phrases common to articles can be extracted.

（４）前記固有表現記憶部は、機械学習を用いて抽出した前記固有表現のパターンを記憶し、
前記固有表現抽出手段は、前記固有表現のパターンに基づいて、前記入力記事と前記類似記事の前記タイトルとから各々前記固有表現を抽出する、
（１）から（３）までのいずれか１項に記載の記事特徴語抽出装置。 (4) The specific expression storage unit stores a pattern of the specific expression extracted using machine learning.
The specific expression extraction unit extracts the specific expressions from the input article and the title of the similar article based on the specific expression pattern,
The article feature word extraction device according to any one of (1) to (3).

本発明のこのような構成によれば、固有表現のパターンを予め記憶しておき、それを用いることで、固有表現か否かを簡単に判断でき、固有表現を抽出できる。 According to such a configuration of the present invention, a pattern of a specific expression is stored in advance, and by using it, it can be easily determined whether or not it is a specific expression, and a specific expression can be extracted.

（５）前記類似記事抽出手段は、抽出対象の数を可変して複数の前記類似記事を抽出する、
（１）から（４）までのいずれか１項に記載の記事特徴語抽出装置。 (5) The similar article extracting means extracts a plurality of the similar articles by changing the number of extraction targets.
The article feature word extraction device according to any one of (1) to (4).

本発明のこのような構成によれば、類似記事として抽出する記事数を、任意の数に変更することができる。よって、類似記事の対象の数を増減させて、比較対象になる類似記事の数を変化させることで、抽出する特徴語に変化を与えることができる。 According to such a configuration of the present invention, the number of articles extracted as similar articles can be changed to an arbitrary number. Therefore, the feature word to be extracted can be changed by increasing or decreasing the number of similar articles and changing the number of similar articles to be compared.

（６）前記特徴語抽出手段により抽出した特徴語を識別可能にした前記入力記事を出力する記事出力手段を備える、
（１）から（５）までのいずれか１項に記載の記事特徴語抽出装置。 (6) comprising an article output means for outputting the input article in which the feature words extracted by the feature word extraction means can be identified;
The article feature word extraction device according to any one of (1) to (5).

本発明のこのような構成によれば、入力記事を、特徴語を識別可能にして出力するので、入力記事の中の特徴語に、例えば、下線を付したり、太字にしたりすることで、特徴語をユーザに分かりやすい態様で出力することができる。 According to such a configuration of the present invention, since the input article is output with the feature word being identifiable, the feature word in the input article is underlined or bolded, for example, Feature words can be output in a manner that is easy for the user to understand.

（７）前記特徴語抽出手段は、前記入力記事と、前記類似記事の前記タイトルとに一致する前記汎化固有表現を、前記特徴語として抽出し、
前記記事出力手段は、前記特徴語として抽出した前記汎化固有表現を前記固有表現に特化して前記入力記事を出力する、
（６）に記載の記事特徴語抽出装置。 (7) The feature word extraction unit extracts, as the feature word, the generalized specific expression that matches the input article and the title of the similar article,
The article output means outputs the input article by specializing the generalized specific expression extracted as the feature word into the specific expression;
The article feature word extraction device according to (6).

本発明のこのような構成によれば、入力記事と、類似記事のタイトルとに一致する汎化した固有表現を抽出して、固有表現を識別可能にして出力するので、入力記事と、類似記事のタイトルとを比較することで、タイトルとして馴染みの多い単語、複合語や文節を、特徴語として抽出できる。また、抽出した汎化固有表現は、元の固有表現に戻して識別可能に出力するので、分かりやすく固有表現を出力できる。 According to such a configuration of the present invention, a generalized specific expression that matches the input article and the title of the similar article is extracted, and the specific expression can be identified and output. By comparing with the titles, words, compound words, and phrases that are familiar to the titles can be extracted as feature words. Further, since the extracted generalized specific expression is returned to the original specific expression and output so as to be identifiable, the specific expression can be output in an easy-to-understand manner.

（８）ユーザが作成した前記入力記事に関するタイトルの入力を受け付けるタイトル入力手段と、
前記タイトルと前記入力記事とを対応付けて前記過去記事記憶部に記憶する記事蓄積手段と、
を備える、
（１）から（７）までのいずれか１項に記載の記事特徴語抽出装置。 (8) Title input means for receiving input of a title related to the input article created by the user;
Article storage means for storing the title and the input article in association with each other in the past article storage unit;
Comprising
The article feature word extraction device according to any one of (1) to (7).

本発明のこのような構成によれば、過去の記事に、本装置の出力により作成されたタイトル付きの記事を含めることができる。ユーザが付したタイトルに対応付けて、過去の記事を蓄積するので、出力した入力記事を、他の入力記事において特徴語を抽出する対象になる類似記事にすることができる。 According to such a configuration of the present invention, an article with a title created by the output of the apparatus can be included in a past article. Since past articles are stored in association with titles given by the user, the output input article can be a similar article from which feature words are extracted in other input articles.

（９）コンピュータによって、入力記事に対応するタイトル用の特徴語を抽出する記事特徴語抽出方法であって、
カテゴリに分類された過去の記事とそのタイトルとを対応付けて記憶する過去記事記憶ステップと、
固有表現を記憶する固有表現記憶ステップと、
前記カテゴリが付された入力記事を受け付ける記事入力ステップと、
前記過去記事記憶ステップにより記憶された前記過去の記事から、前記記事入力ステップが受け付けた前記入力記事にカテゴリが一致し、前記入力記事に類似する類似記事を抽出する類似記事抽出ステップと、
前記入力記事と、前記類似記事の前記タイトルとから、前記固有表現記憶ステップにより記憶された前記固有表現を各々抽出する固有表現抽出ステップと、
前記固有表現抽出ステップにより抽出した前記固有表現を汎化して、前記入力記事及び前記類似記事の前記タイトルに汎化固有表現を適用する固有表現汎化ステップと、
前記固有表現汎化ステップによる前記汎化固有表現を適用後の前記入力記事及び前記類似記事の前記タイトルから、特徴語を抽出する特徴語抽出ステップと、
を含む記事特徴語抽出方法。 (9) An article feature word extraction method for extracting feature words for a title corresponding to an input article by a computer,
A past article storing step of storing past articles classified into categories and their titles in association with each other;
A proper expression storage step for storing the specific expression;
An article input step for accepting an input article with the category;
A similar article extraction step for extracting a similar article similar in category to the input article received in the article input step from the past article stored in the past article storage step;
A specific expression extraction step for extracting each of the specific expressions stored in the specific expression storage step from the input article and the title of the similar article;
A specific expression generalizing step of generalizing the specific expression extracted by the specific expression extracting step, and applying a generalized specific expression to the title of the input article and the similar article;
A feature word extraction step of extracting a feature word from the title of the input article and the similar article after applying the generalized specific expression by the specific expression generalization step;
Article feature word extraction method including

（１０）（９）に記載の記事特徴語抽出方法のステップをコンピュータに実行させるためのプログラム。 (10) A program for causing a computer to execute the steps of the article feature word extraction method according to (9).

本発明によれば、入力記事と同じカテゴリの過去の記事である類似記事に基づいて、タイトル用の特徴語を効率的に抽出することができる。特に、固有表現を汎化することで、類似記事のタイトルに一致する特徴語として固有表現を抽出するので、タイトルで重要である固有表現を特徴語として抽出することができる。 According to the present invention, feature words for titles can be efficiently extracted based on similar articles that are past articles in the same category as the input article. In particular, by generalizing the specific expressions, the specific expressions are extracted as feature words that match the titles of similar articles, so that specific expressions that are important in the title can be extracted as feature words.

本実施形態に係る記事特徴語抽出装置の機能構成を示す図である。It is a figure which shows the function structure of the article feature word extraction apparatus which concerns on this embodiment. 本実施形態に係る記事特徴語抽出装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the article feature word extraction apparatus which concerns on this embodiment. 本実施形態に係る記事特徴語抽出装置による処理の例を示す図である。It is a figure which shows the example of the process by the article feature word extraction apparatus which concerns on this embodiment. 本実施形態に係るメイン処理のフローチャートである。It is a flowchart of the main process which concerns on this embodiment. 本実施形態に係る類似記事抽出処理のフローチャートである。It is a flowchart of the similar article extraction process which concerns on this embodiment. 本実施形態に係る固有表現の抽出及び汎化方法を示す図である。It is a figure which shows the extraction and generalization method of the specific expression which concerns on this embodiment. 本実施形態に係る具体例を示す図である。It is a figure which shows the specific example which concerns on this embodiment.

以下、本発明を実施するための形態について図を参照しながら説明する。なお、これはあくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.

（実施形態）
［記事特徴語抽出装置１の機能構成］
図１は、本実施形態に係る記事特徴語抽出装置１の機能構成を示す図である。記事特徴語抽出装置１は、入力記事に関するタイトル用の特徴語を抽出して、識別可能にされた特徴語を出力する装置である。入力記事には、予め分類されたカテゴリのうちの１つが対応付けられている。タイトル用とは、入力記事の内容を端的に表す、例えば、タイトル、サブタイトル、要約の作成で用いることができることをいう。また、特徴語とは、入力記事に関する特徴が表れた単語、複合語や文節をいう。 (Embodiment)
[Functional configuration of article feature word extraction device 1]
FIG. 1 is a diagram illustrating a functional configuration of an article feature word extraction device 1 according to the present embodiment. The article feature word extraction device 1 is a device that extracts a feature word for a title related to an input article and outputs a feature word that can be identified. One of the categories classified in advance is associated with the input article. “For title” means that the content of the input article can be expressed in a straightforward manner, for example, in the creation of a title, a subtitle, and a summary. A feature word is a word, compound word, or phrase in which a feature about an input article appears.

記事特徴語抽出装置１は、入力部３、出力部５、制御部１０、記憶部２０を備える。入力部３は、例えば、キーボードやマウス等であり、記事特徴語抽出装置１の処理対象になる入力記事を入力する装置である。入力部３は、通信回線（図示せず）を介して接続された端末（図示せず）や文書サーバ等から入力された記事を受信する通信部であってもよい。出力部５は、例えば、ディスプレイ等であり、タイトル用の特徴語を抽出した後の入力記事を出力する装置である。出力部５は、通信回線を介して接続された端末等に処理後の入力記事を送信する通信部であってもよい。 The article feature word extraction device 1 includes an input unit 3, an output unit 5, a control unit 10, and a storage unit 20. The input unit 3 is, for example, a keyboard or a mouse, and is a device that inputs an input article to be processed by the article feature word extraction device 1. The input unit 3 may be a communication unit that receives articles input from a terminal (not shown) or a document server connected via a communication line (not shown). The output unit 5 is, for example, a display or the like, and is a device that outputs an input article after extracting feature words for titles. The output unit 5 may be a communication unit that transmits the processed input article to a terminal connected via a communication line.

制御部１０は、記事入力手段１１、類似記事抽出手段１２、固有表現抽出手段１３、固有表現汎化手段１４、特徴語抽出手段１５、記事出力手段１６、仮タイトル出力手段１７、タイトル入力手段１８、記事蓄積手段１９を備える。記憶部２０は、過去の記事を記憶する過去記事記憶部２２、固有表現のパターンを記憶する固有表現記憶部２４を備える。固有表現（ＮａｍｅｄＥｎｔｉｔｙ）とは、例えば、地名、人名、組織名等の固有名詞や日付、時間表現、金額表現、数量表現等をいう。各機能の詳細は、後述する。 The control unit 10 includes an article input unit 11, a similar article extraction unit 12, a specific expression extraction unit 13, a specific expression generalization unit 14, a feature word extraction unit 15, an article output unit 16, a temporary title output unit 17, and a title input unit 18. The article storage means 19 is provided. The storage unit 20 includes a past article storage unit 22 that stores past articles, and a specific expression storage unit 24 that stores patterns of specific expressions. Specific expressions (named entities) refer to, for example, proper nouns such as place names, personal names, and organization names, dates, time expressions, monetary expressions, and quantity expressions. Details of each function will be described later.

記事特徴語抽出装置１は、ハードウェアの数に制限はなく、必要に応じて一又は複数のハードウェアで構成してよい。また、記事特徴語抽出装置１は、複数のハードウェアで構成する場合には、例えば、通信回線を介して各ハードウェアを接続してもよい。上述した各機能毎に別サーバとし、各サーバ間での信号の送受信により、各サーバを連携させることで、本実施形態の機能を実現してもよい。 The article feature word extraction device 1 is not limited in the number of hardware, and may be configured by one or a plurality of hardware as necessary. Further, when the article feature word extraction device 1 is configured by a plurality of hardware, for example, each hardware may be connected via a communication line. The functions of this embodiment may be realized by using a separate server for each function described above and linking the servers by transmitting and receiving signals between the servers.

［記事特徴語抽出装置１のハードウェア構成図］
図２は、本実施形態に係る記事特徴語抽出装置１のハードウェア構成を示す図である。本発明が実施される処理装置は標準的なものでよく、以下に、構成の一例を示す。 [Hardware configuration diagram of article feature word extraction device 1]
FIG. 2 is a diagram illustrating a hardware configuration of the article feature word extraction device 1 according to the present embodiment. The processing apparatus in which the present invention is implemented may be a standard one, and an example of the configuration is shown below.

記事特徴語抽出装置１は、制御部１０を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１０（マルチプロセッサ構成ではＣＰＵ１０１２等複数のＣＰＵが追加されてもよい）、バスライン１００５、通信Ｉ／Ｆ（Ｉ／Ｆ：インタフェース）１０４０、メインメモリ１０５０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）１０６０、表示装置１０２２、Ｉ／Ｏコントローラ１０７０、キーボード及びマウス等の入力装置１１００、ハードディスク１０７４、光ディスクドライブ１０７６、並びに半導体メモリ１０７８を備える。なお、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８はまとめて記憶部２０と呼ぶ。 The article feature word extraction device 1 includes a CPU (Central Processing Unit) 1010 (a plurality of CPUs such as a CPU 1012 may be added in a multiprocessor configuration), a bus line 1005, a communication I / F (I / F). F: Interface) 1040, main memory 1050, BIOS (Basic Input Output System) 1060, display device 1022, I / O controller 1070, input device 1100 such as a keyboard and mouse, hard disk 1074, optical disk drive 1076, and semiconductor memory 1078 Prepare. The hard disk 1074, the optical disk drive 1076, and the semiconductor memory 1078 are collectively referred to as the storage unit 20.

制御部１０は、記事特徴語抽出装置１を統括的に制御する部分であり、ハードディスク１０７４に記憶された各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The control unit 10 is a part that controls the article feature word extraction device 1 in an integrated manner. By appropriately reading and executing various programs stored in the hard disk 1074, the control unit 10 cooperates with the hardware described above, and Various functions are realized.

通信Ｉ／Ｆ１０４０は、記事特徴語抽出装置１が、通信回線を介して他の装置と情報を送受信する場合のネットワーク・アダプタである。通信Ｉ／Ｆ１０４０は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。 The communication I / F 1040 is a network adapter when the article feature word extraction device 1 transmits and receives information to and from other devices via a communication line. The communication I / F 1040 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.

ＢＩＯＳ１０６０は、記事特徴語抽出装置１の起動時にＣＰＵ１０１０が実行するブートプログラムや、記事特徴語抽出装置１のハードウェアに依存するプログラム等を記録する。 The BIOS 1060 records a boot program executed by the CPU 1010 when the article feature word extraction apparatus 1 is started, a program depending on the hardware of the article feature word extraction apparatus 1, and the like.

表示装置１０２２は、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含む。 The display device 1022 includes a display device such as a cathode ray tube display device (CRT) or a liquid crystal display device (LCD).

Ｉ／Ｏコントローラ１０７０には、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８等の記憶装置である記憶部２０を接続することができる。 The I / O controller 1070 can be connected to a storage unit 20 that is a storage device such as a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078.

入力装置１１００は、記事特徴語抽出装置１の管理者による入力の受け付けを行うものである。 The input device 1100 accepts input by the administrator of the article feature word extraction device 1.

ハードディスク１０７４は、本ハードウェアを記事特徴語抽出装置１として機能させるための各種プログラム、本発明の機能を実行するプログラム及び上述した過去記事記憶部２２及び固有表現記憶部２４等を記憶する。なお、記事特徴語抽出装置１は、外部に別途設けたハードディスク（図示せず）を外部記憶装置として利用することもできる。 The hard disk 1074 stores various programs for causing the hardware to function as the article feature word extraction device 1, a program for executing the functions of the present invention, the above-described past article storage unit 22, the specific expression storage unit 24, and the like. Note that the article feature word extraction device 1 can also use a hard disk (not shown) separately provided as an external storage device.

光ディスクドライブ１０７６としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ＢＤ（Ｂｌｕ−ｒａｙ（登録商標）Ｄｉｓｃ）ドライブを使用することができる。この場合は各ドライブに対応した光ディスク１０７７を使用する。光ディスク１０７７から光ディスクドライブ１０７６によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供することもできる。 As the optical disk drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, or a BD (Blu-ray (registered trademark) Disc) drive can be used. In this case, the optical disk 1077 corresponding to each drive is used. A program or data may be read from the optical disk 1077 by the optical disk drive 1076 and provided to the main memory 1050 or the hard disk 1074 via the I / O controller 1070.

なお、本発明でいうコンピュータとは、記憶装置、制御部等を備えた情報処理装置をいい、記事特徴語抽出装置１は、記憶部２０、制御部１０等を備えた情報処理装置により構成され、この情報処理装置は、本発明のコンピュータの概念に含まれる。 The computer referred to in the present invention refers to an information processing device including a storage device, a control unit, and the like, and the article feature word extraction device 1 includes an information processing device including the storage unit 20, the control unit 10, and the like. This information processing apparatus is included in the concept of the computer of the present invention.

［具体例１］
図３は、本実施形態に係る記事特徴語抽出装置１による処理の例を示す図である。図３（１）は、入力記事の本文３０を示す。本文３０のカテゴリである「温暖化」は、予めユーザにより指定されている。カテゴリは、記事を分類するためのトピックであり、例えば、政治、社会、経済、スポーツ、・・・等であり、入力記事は、カテゴリで分類される。カテゴリは、大項目、中項目、小項目等の階層形式になっていてもよく、その場合は、入力記事に指定するカテゴリは、小項目等の最小の単位のものを指定するのが望ましい。 [Specific Example 1]
FIG. 3 is a diagram illustrating an example of processing by the article feature word extraction device 1 according to the present embodiment. FIG. 3A shows the text 30 of the input article. The “warming” that is the category of the text 30 is designated in advance by the user. The category is a topic for classifying articles, for example, politics, society, economy, sports, etc., and the input articles are classified by category. The category may be in a hierarchical format such as a large item, a medium item, and a small item. In this case, it is desirable that the category specified for the input article is a minimum unit such as a small item.

次に、制御部１０は、本文３０に類似する類似記事のタイトルと本文とを、過去記事記憶部２２から抽出する。過去記事記憶部２２は、記憶部２０に有する。図３（２）に示すように、過去記事記憶部２２は、カテゴリ２２ａと、タイトル２２ｂと、本文２２ｃとの項目を有し、１つのレコード（行）は、１つの記事を表す。 Next, the control unit 10 extracts the title and text of a similar article similar to the text 30 from the past article storage unit 22. The past article storage unit 22 is included in the storage unit 20. As shown in FIG. 3B, the past article storage unit 22 has items of a category 22a, a title 22b, and a body 22c, and one record (row) represents one article.

図３（２）は、過去記事記憶部２２から抽出された類似記事を示す。類似記事は、過去記事記憶部２２のカテゴリ２２ａに記憶された入力記事と同一のカテゴリのデータの中から抽出する。この例では、入力記事と同じ「温暖化」のカテゴリのものが抽出されている。類似記事は、入力記事に対して、特徴語を抽出するのに比較対象にする記事である。抽出した類似記事は、タイトル３１と本文３２とからなる。タイトル３１は、過去記事記憶部２２のタイトル２２ｂから、本文３２は、過去記事記憶部２２の本文２２ｃから、それぞれ取得する。 FIG. 3 (2) shows similar articles extracted from the past article storage unit 22. Similar articles are extracted from data in the same category as the input article stored in the category 22 a of the past article storage unit 22. In this example, the same “warming” category as the input article is extracted. The similar article is an article to be compared for extracting feature words from the input article. The extracted similar article includes a title 31 and a body 32. The title 31 is acquired from the title 22b of the past article storage unit 22, and the text 32 is acquired from the text 22c of the past article storage unit 22, respectively.

次に、制御部１０は、入力記事の本文３０と、類似記事のタイトル３１とから、各々固有表現を抽出して汎化する。図３（３）は、本文３０とタイトル３１とに含む固有表現と、固有表現に対応する汎化固有表現とを示す。入力記事の本文３０から抽出された固有表現は、「フランス」という地名や、「２３日」という日付であり、固有表現部３３に格納される。汎化部３４は、固有表現部３３に格納された固有表現を汎化した汎化固有表現を格納する。固有表現の汎化は、固有表現記憶部２４に基づいて行われる。同様に、類似記事のタイトル３１から抽出された固有表現は、固有表現部３５に格納される。汎化部３６は、固有表現部３５に格納された固有表現を汎化した汎化固有表現を格納する。これらの固有表現部３３，３５、汎化部３４，３６は、一時的に記憶部２０に記憶してよい。 Next, the control unit 10 extracts and generalizes each specific expression from the body 30 of the input article and the title 31 of the similar article. FIG. 3 (3) shows a specific expression included in the text 30 and the title 31, and a generalized specific expression corresponding to the specific expression. The specific expression extracted from the text 30 of the input article is a place name “France” and a date “23 days”, and is stored in the specific expression unit 33. The generalization unit 34 stores a generalized specific expression obtained by generalizing the specific expression stored in the specific expression unit 33. The generalized expression is performed based on the specific expression storage unit 24. Similarly, the specific expression extracted from the title 31 of the similar article is stored in the specific expression unit 35. The generalization unit 36 stores a generalized specific expression obtained by generalizing the specific expression stored in the specific expression unit 35. These unique representation units 33 and 35 and generalization units 34 and 36 may be temporarily stored in the storage unit 20.

次に、固有表現が汎化された状態で、特徴語を抽出する。特徴語は、類似記事のタイトル３１に含む汎化固有表現に一致する、入力記事の本文３０の汎化固有表現を含む。汎化固有表現である特徴語は、この例では、「地名」、「日付」である。また、特徴語は、類似記事のタイトル３１の汎化固有表現を除いた単語、複合語や文節であって、本文３０に有するものを含む。特徴語は、この例では、「北極海」、「海氷面積」、「減少」、「発表」である。ここで、「海氷面積」は、単語「氷」を含む複合語である。 Next, feature words are extracted in a state where the specific expressions are generalized. The feature word includes the generalized specific expression of the body 30 of the input article that matches the generalized specific expression included in the title 31 of the similar article. In this example, the feature words that are generalized specific expressions are “place name” and “date”. The feature words include words, compound words, and phrases excluding the generalized specific expression of the title 31 of the similar article, which are included in the text 30. In this example, the characteristic words are “Arctic Ocean”, “Sea Ice Area”, “Decrease”, “Announcement”. Here, “sea ice area” is a compound word including the word “ice”.

図３（４）は、特徴語を識別可能にした入力記事の本文３７と、入力記事の仮タイトル３８とを示す。ここで、汎化固有表現によって抽出した特徴語は、特徴語リストとして出力してもよいが、図示するように、入力記事の本文３７に含めた状態で、識別可能に出力するようにしてもよい。入力記事の本文３７は、この例のように特徴語に下線を付してもよいし、あるいは、色付けしたり、強調文字にしたりして、ユーザに一見して識別可能な状態で出力することが望ましい。また、図中の仮タイトル３８は、図３（２）の類似記事のタイトル３１うち特徴語として抽出した箇所を、入力記事の特徴語として抽出したものに置き換えたものである。ここで、仮タイトル３８は、置き換えが複数ある場合は、複数のタイトル候補として表示してもよい。この置き換えは、入力記事と類似記事のタイトルとの対応付け（特に固有表現の種別が同じものの対応付け）により行うことができる。このとき、ユーザには、類似記事のタイトル３１と仮タイトル３８とを並べて表示するようにしてもよい。また、このようにして作成された仮タイトル３８を、ユーザは必要に応じて編集できるようにすることが望ましい。 FIG. 3 (4) shows the input article body 37 and the input article provisional title 38 in which feature words can be identified. Here, the feature words extracted by the generalized specific expression may be output as a feature word list. However, as shown in the figure, the feature words may be output so as to be identifiable while being included in the body 37 of the input article. Good. The text 37 of the input article may be underlined with the feature word as in this example, or may be colored or highlighted to be output in a state that can be identified at a glance to the user. Is desirable. Further, the temporary title 38 in the figure is obtained by replacing the part extracted as the feature word in the title 31 of the similar article in FIG. 3B with the one extracted as the feature word of the input article. Here, when there are a plurality of replacements, the temporary title 38 may be displayed as a plurality of title candidates. This replacement can be performed by associating the input article with the title of the similar article (particularly, associating the same specific expression type). At this time, the title 31 and the temporary title 38 of the similar article may be displayed side by side for the user. Further, it is desirable that the user can edit the temporary title 38 created in this way as necessary.

［フローチャート］
次に、図３で説明した処理の流れを説明する。図４は、本実施形態に係るメイン処理のフローチャートである。図５は、本実施形態に係る類似記事抽出処理のフローチャートである。図６は、本実施形態に係る固有表現の抽出及び汎化方法を示す図である。以降、図４から図６を参照しながら説明する。 [flowchart]
Next, the flow of processing described in FIG. 3 will be described. FIG. 4 is a flowchart of the main process according to the present embodiment. FIG. 5 is a flowchart of similar article extraction processing according to the present embodiment. FIG. 6 is a diagram showing a method for extracting and generalizing a specific expression according to the present embodiment. Hereinafter, a description will be given with reference to FIGS.

図４において、Ｓ１：制御部１０（記事入力手段１１）は、入力部３からカテゴリ付きの入力記事を受け付ける。記事特徴語抽出装置１の入力部３から入力された入力記事は、記憶部２０に一時的に記憶してよい。 In FIG. 4, S <b> 1: the control unit 10 (article input unit 11) receives an input article with a category from the input unit 3. The input article input from the input unit 3 of the article feature word extraction device 1 may be temporarily stored in the storage unit 20.

Ｓ２：制御部１０（類似記事抽出手段１２）は、入力記事に類似する記事（類似記事）を、過去記事記憶部２２から抽出する類似記事抽出処理を行う。 S2: The control unit 10 (similar article extraction means 12) performs a similar article extraction process for extracting an article (similar article) similar to the input article from the past article storage unit 22.

ここで、類似記事抽出処理について、図５に基づき説明する。 Here, the similar article extraction processing will be described with reference to FIG.

図５において、Ｓ１１：類似記事抽出手段１２は、入力記事のカテゴリと同一のカテゴリの記事を、過去記事記憶部２２から抽出する。 In FIG. 5, S11: the similar article extracting unit 12 extracts articles of the same category as the category of the input article from the past article storage unit 22.

Ｓ１２：類似記事抽出手段１２は、入力記事と、抽出した同一のカテゴリの記事の本文及びそのタイトルとに対して形態素解析を行う。 S12: The similar article extraction means 12 performs morphological analysis on the input article, and the extracted body text and title of the same category.

Ｓ１３：類似記事抽出手段１２は、形態素解析によって各記事の本文に含まれる各単語に対して、単語の出現頻度に基づく指標であるＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）を用いて単語の重み付けを行う。これにより、他の記事にはあまり出現しないが、その記事の中で高頻度で用いられている単語の重み付けは大きくなる。 S13: The similar article extracting means 12 uses TF (Term Frequency) / IDF (Inverse Document Frequency), which is an index based on the appearance frequency of words, for each word included in the body of each article by morphological analysis. Weighting is performed. Thereby, although it does not appear so much in other articles, weighting of words frequently used in the article increases.

Ｓ１４：類似記事抽出手段１２は、入力記事に含む単語であって、重み付けの上位のものを特徴語として抽出する。なお、特徴語は、複数個を抽出してよい。 S14: The similar article extracting means 12 extracts words included in the input article and having higher weights as feature words. A plurality of feature words may be extracted.

Ｓ１５：類似記事抽出手段１２は、抽出した特徴語を抽出した同一カテゴリの類似記事に対してＯＲ検索して、スコアの高いものから順番にｎ件分の記事を抽出する。抽出する記事の件数（ｎ）は、予め定めておいてもよいし、ユーザが入力部３から入力してもよい。また、制御部１０がランダムに件数を決定してもよい。 S15: The similar article extraction means 12 performs an OR search on similar articles of the same category from which the extracted feature words are extracted, and extracts n articles in order from the highest score. The number (n) of articles to be extracted may be determined in advance, or may be input from the input unit 3 by the user. Moreover, the control part 10 may determine the number of cases at random.

このように、類似記事として使用する上位ｎ件を可変にすることで、抽出される特徴語を変化させることができる。 In this way, the feature words to be extracted can be changed by changing the top n items used as similar articles.

図４に戻って、Ｓ３：制御部１０（固有表現抽出手段１３）は、入力記事と、抽出した類似記事のタイトルとから、固有表現を抽出する。 Returning to FIG. 4, S3: The control unit 10 (the specific expression extraction unit 13) extracts the specific expression from the input article and the title of the extracted similar article.

Ｓ４：制御部１０（固有表現汎化手段１４）は、固有表現記憶部２４の学習データに基づき、入力記事と類似記事のタイトルとに含む各々の固有表現を汎化する。なお、ここで、タイトルとは、その内容部分が所定の最大文字数に制限された見出しをいい、最大文字数は、例えば、１３文字である。 S4: The control unit 10 (specific expression generalization means 14) generalizes each specific expression included in the input article and the title of the similar article based on the learning data in the specific expression storage unit 24. Here, the title means a headline whose content is limited to a predetermined maximum number of characters, and the maximum number of characters is, for example, 13 characters.

ここで、固有表現の抽出及び汎化について、図６の例に基づき説明する。 Here, extraction and generalization of specific expressions will be described based on the example of FIG.

図６（１）は、固有表現記憶部２４に記憶された学習元データ４０の一例である。学習元データ４０は、文書を形態素解析をして、表記４１、品詞４２、活用４３を付し、タグ４４に正解のタグを付したものである。タグ４４の「ＯＲＧ」は、組織を、「ＰＥＲ」は、人名を表す汎化固有表現である。また、タグ４４の「Ｉ−」は、「Ｂ−」に続くものであることを意味する記号である。タグ４４の「Ｏ」は、固有表現ではないｏｔｈｅｒを示す記号である。 FIG. 6A is an example of learning source data 40 stored in the specific expression storage unit 24. The learning source data 40 is obtained by performing morphological analysis on a document, adding a notation 41, part of speech 42, and utilization 43, and adding a correct tag to the tag 44. “ORG” of the tag 44 is a generalized unique expression representing an organization, and “PER” is a personal name. Further, “I−” of the tag 44 is a symbol that means that it follows “B−”. “O” of the tag 44 is a symbol indicating “other” which is not a unique expression.

また、固有表現記憶部２４には、固有表現を判定するための素性５０を用意する。素性５０は、１つの例示であり、複数用意されてよい。そして、素性５０を、学習元データ４０の行４５に示す「直弘」に用いたのものが、学習データ５１である。素性５０を、各形態素の全て、つまり、学習元データ４０の全ての行について行っておく。そして、これらを学習データとして、パターンを固有表現記憶部２４に記憶しておく。そうすることで、未知の文書が入力された場合に、どの単語等が固有表現であるのかの判断に用いることができる。 In addition, the specific expression storage unit 24 prepares a feature 50 for determining a specific expression. The feature 50 is one example, and a plurality of features 50 may be prepared. The learning data 51 is obtained by using the feature 50 for “Naohiro” shown in the row 45 of the learning source data 40. The feature 50 is performed for all the morphemes, that is, all the rows of the learning source data 40. Then, these are stored as learning data in the specific expression storage unit 24. By doing so, when an unknown document is input, it can be used to determine which word or the like is a specific expression.

図６（２）に示す文書の対象データ６０が入力された場合に、行６１の「哲」について素性５０を用いた結果データ６２である。結果データ６２から、学習データ５１の前後２単語の「氏」と「は」とが共通するので、「哲」は、人名のタグである「ＰＥＲ」が付されるのではないかと推測できる。そこで、行６１の「哲」については、人名タグ「ＰＥＲ」を付すことで、固有表現にすることができる。 When the target data 60 of the document shown in FIG. 6B is input, the result data 62 uses the feature 50 for “Tetsu” in the row 61. From the result data 62, the two words “Mr.” and “ha” before and after the learning data 51 are common, so it can be inferred that “Tetsu” is attached with “PER”, which is a tag of the personal name. Therefore, “Tetsu” in line 61 can be made a unique expression by attaching a personal name tag “PER”.

図４に戻って、Ｓ５：制御部１０（特徴語抽出手段１５）は、入力記事と、類似記事のタイトルとの両者に含む特徴語を抽出する。具体的には、制御部１０は、入力記事と、類似記事のタイトルとに含む汎化固有表現を抽出し、入力記事と、類似記事のタイトルとに一致する単語、複合語及び文節を抽出する。 Returning to FIG. 4, S5: The control unit 10 (feature word extraction means 15) extracts feature words included in both the input article and the title of the similar article. Specifically, the control unit 10 extracts a generalized specific expression included in the input article and the title of the similar article, and extracts words, compound words, and phrases that match the input article and the title of the similar article. .

Ｓ６：制御部１０（記事出力手段１６）は、特徴語を識別可能にした入力記事を出力部５に出力する。その際、制御部１０は、汎化固有表現を汎化する前の固有表現に戻して（特化して）、特徴語に下線を付す。このようにすることで、ユーザは、出力された入力記事から特徴語を識別できるので、例えば、記事の斜め読みに有効である。 S6: The control unit 10 (article output means 16) outputs the input article in which the feature word can be identified to the output unit 5. At that time, the control unit 10 returns (specializes) the generalized specific expression to the specific expression before generalization, and underlines the feature word. By doing so, the user can identify the feature word from the output input article, which is effective for, for example, oblique reading of the article.

Ｓ７：制御部１０（仮タイトル出力手段１７）は、類似記事のタイトルの特徴語を入力記事の特徴語に置き換えて、入力記事に関する仮タイトルを作成して出力部５に出力する。その際、制御部１０は、汎化固有表現を汎化する前の固有表現に戻して（特化して）、特徴語に下線を付してもよい。このように、制御部１０によって、入力記事のタイトルの作成を半自動化できるので、ユーザによるタイトル作成の効率を向上することができる。 S7: The control unit 10 (temporary title output means 17) replaces the feature word of the title of the similar article with the feature word of the input article, creates a temporary title related to the input article, and outputs it to the output unit 5. At that time, the control unit 10 may return (specify) the generalized specific expression to the specific expression before generalization, and underline the feature word. As described above, since the title of the input article can be semi-automated by the control unit 10, the efficiency of title creation by the user can be improved.

Ｓ８：制御部１０（タイトル入力手段１８）は、ユーザにより作成された入力記事のタイトルを、入力部３から受け付ける。 S8: The control unit 10 (title input means 18) receives the title of the input article created by the user from the input unit 3.

Ｓ９：制御部１０（記事蓄積手段１９）は、入力記事と、そのタイトルとを対応付けて過去記事記憶部２２に記憶する。 S9: The control unit 10 (article accumulation means 19) stores the input article and its title in the past article storage unit 22 in association with each other.

このように、記事特徴語抽出装置１は、入力記事と同じカテゴリの過去の記事である類似記事のタイトルに基づいて、タイトル用の特徴語を効率的に抽出することができる。特に、固有表現を汎化することで、記事特徴語抽出装置１は、類似記事のタイトルに一致する特徴語として固有表現を抽出するので、タイトルで重要である固有表現を特徴語として抽出することができる。 As described above, the article feature word extraction apparatus 1 can efficiently extract feature words for a title based on the title of a similar article that is a past article in the same category as the input article. In particular, by generalizing the unique expressions, the article feature word extraction device 1 extracts the unique expressions as feature words that match the titles of similar articles, so that the unique expressions that are important in the title are extracted as feature words. Can do.

［具体例２］
次に、実際の記事に基づく具体例を示す。図７は、本実施形態に係る具体例を示す図である。 [Specific Example 2]
Next, a specific example based on an actual article is shown. FIG. 7 is a diagram illustrating a specific example according to the present embodiment.

図７（１）は、入力記事の本文７０を示す。入力記事のカテゴリは、「芸能」である。なお、最初に入力記事が与えられた状態では、本文７０は、下線等は付されていない。 FIG. 7A shows the text 70 of the input article. The category of the input article is “entertainment”. Note that the text 70 is not underlined or the like when the input article is first given.

図７（２）は、過去記事記憶部２２に記憶され、カテゴリが「芸能」である記事の中から、入力記事に類似する記事とそのタイトルとを３つ抽出したものである。類似記事は、例えば、図５に示した方法で抽出することができる。ここでは、本文７１とタイトル７２とのペア、本文７３とタイトル７４とのペア、本文７５とタイトル７６とのペアの３つである。本文７１，７３，７５は、いずれも本文７０に類似している。 FIG. 7B shows three articles extracted from the articles stored in the past article storage unit 22 and whose category is “entertainment” and three titles similar to the input article. Similar articles can be extracted, for example, by the method shown in FIG. Here, there are three pairs: a body 71 and title 72 pair, a body 73 and title 74 pair, and a body 75 and title 76 pair. The texts 71, 73, and 75 are all similar to the text 70.

図７（３）は、タイトルの固有表現を汎化したものである。タイトル７２，７４，７６には、共通してタレント名が入っており、これを人名を表すタグ＜ＰＥＲ＞で汎化している。また、図７（４）は、入力記事の本文７０の固有表現を汎化したものである。本文７０に含むタレント名を、人名を表すタグ＜ＰＥＲ＞で汎化している。これらは、図６で示した学習データとして記憶されたパターンから、人名を表すものと判断できる。 FIG. 7 (3) is a generalization of the unique expression of the title. The titles 72, 74, and 76 commonly include a talent name, which is generalized with a tag <PER> representing the personal name. FIG. 7 (4) is a generalization of the specific expression of the text 70 of the input article. The talent name included in the text 70 is generalized with a tag <PER> representing the person name. These can be determined to represent personal names from the patterns stored as the learning data shown in FIG.

図７（５）は、本文７０と、タイトル７２，７４，７６との特徴語を抽出して、本文７０に反映したものである。固有表現を汎化した状態で比較して特徴語を抽出することで、タイトルに相応しい固有表現を拾うことができる。 In FIG. 7 (5), characteristic words of the text 70 and the titles 72, 74, 76 are extracted and reflected in the text 70. By extracting feature words by comparing the generalized expressions in a generalized state, it is possible to pick up specific expressions suitable for the title.

本文７０は、特徴語をユーザが識別しやすいような状態に変更されて、出力部５に表示されるので、ユーザは、本文７０から特徴語を抜き出して、例えば、「イジ○○岡田氏、ろっ骨を骨折」という、記事の内容を端的に示すタイトルを、簡単に付すことができる。 Since the text 70 is changed to a state in which the user can easily identify the feature word and is displayed on the output unit 5, the user extracts the feature word from the text 70, for example, “Mr. You can easily add a title that clearly shows the content of the article, “A fracture of the ribs”.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施例に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

（変形形態）
本実施形態では、特徴語に下線を付して入力記事を出力していたが、ユーザに識別可能なものであれば、例えば、特徴語を他の単語、複合語や文節とは異なる表示態様にして出力してもよい。表示態様としては、色やフォントを異なるものにすることが考えられる。このように、特徴語を他の単語、複合語や文節とは異なる表示態様で出力することで、ユーザが識別可能に特徴語を把握できるので、例えば、記事の斜め読みに有効である。また、類似記事として使用する上位ｎ件を可変にすることで、抽出させる特徴語も変化させることができるので、適切な件数（ｎ）を選ぶことで、複数の斜め読み用の特徴語を表示させることが可能である。 (Deformation)
In the present embodiment, the feature article is underlined and the input article is output. However, if the user can identify the feature article, for example, the feature word is displayed differently from other words, compound words, and phrases. May be output. As a display mode, it is conceivable to use different colors and fonts. Thus, by outputting the feature word in a display mode different from that of other words, compound words, and phrases, the user can grasp the feature word so that the user can identify it. Also, by changing the top n items used as similar articles, the feature words to be extracted can also be changed, so by selecting the appropriate number (n), multiple feature words for oblique reading are displayed. It is possible to make it.

なお、本実施形態では、類似記事を抽出するのに、記事の本文を形態素解析をして、ＴＦ・ＩＤＦを用いて重み付けをすることで抽出したが、入力記事に類似する記事を出力できれば、他の方法であってもよい。 In this embodiment, to extract a similar article, the body of the article is extracted by performing morphological analysis and weighting using TF / IDF. If an article similar to the input article can be output, Other methods may be used.

また、本実施形態では、特徴語を強調した入力記事と共に、類似記事のタイトルの特徴語を入力記事の特徴語に置き換えた仮タイトルを作成して出力する例を示したが、これに限定されない。例えば、入力記事と、類似記事のタイトルとを出力して、識別可能にした入力記事の特徴語と、同じく識別可能にした類似記事のタイトルの特徴語とを線で結ぶことで、各々の対応する場所が示されるようにしてもよい。 Further, in the present embodiment, an example is shown in which a temporary title in which the feature word of the similar article title is replaced with the feature word of the input article is generated and output together with the input article in which the feature word is emphasized, but is not limited thereto. . For example, by outputting the input article and the title of a similar article and connecting the feature word of the input article that can be identified with the feature word of the title of the similar article that is also identifiable, each response You may make it show the place to do.

１記事特徴語抽出装置
１０制御部
１１記事入力手段
１２類似記事抽出手段
１３固有表現抽出手段
１４固有表現汎化手段
１５特徴語抽出手段
１６記事出力手段
１７仮タイトル出力手段
１８タイトル入力手段
１９記事蓄積手段
２０記憶部
２２過去記事記憶部
２４固有表現記憶部 DESCRIPTION OF SYMBOLS 1 Article feature word extraction apparatus 10 Control part 11 Article input means 12 Similar article extraction means 13 Specific expression extraction means 14 Specific expression generalization means 15 Feature word extraction means 16 Article output means 17 Temporary title output means 18 Title input means 19 Article accumulation | storage Means 20 Storage unit 22 Past article storage unit 24 Specific expression storage unit

Claims

In an article feature word extraction device that extracts feature words for a title corresponding to an input article,
An article input means for receiving an input article with a category;
A similar article similar to the input article whose category matches the input article received by the article input unit is extracted from the past article storage unit that stores the title in association with the past article. Similar article extraction means,
Using a specific expression storage unit for storing a specific expression, specific expression extraction means for extracting the specific expression from the input article and the title of the similar article;
A specific expression generalizing means for generalizing the specific expression extracted by the specific expression extracting means and applying a generalized specific expression to the title of the input article and the similar article;
Feature word extraction means for extracting a feature word from the title of the input article and the similar article after application of the generalized specific expression by the specific expression generalization means;
Article feature word extraction device.

Replacing the feature word of the title of the similar article extracted by the feature word extraction means with the corresponding feature word of the input article, and provisional title output means for creating and outputting a temporary title related to the input article,
The article feature word extraction device according to claim 1.

The feature word extraction unit extracts words, compound words, and phrases that match the input article and the title of the similar article as the feature words.
The article feature word extraction device according to claim 1 or 2.

The specific expression storage unit stores the pattern of the specific expression extracted using machine learning,
The specific expression extraction unit extracts the specific expressions from the input article and the title of the similar article based on the specific expression pattern,
The article feature word extraction device according to any one of claims 1 to 3.

The similar article extraction means extracts a plurality of the similar articles by changing the number of extraction targets.
The article feature word extraction device according to any one of claims 1 to 4.

Comprising an article output means for outputting the input article in which the feature words extracted by the feature word extraction means can be identified;
The article feature word extraction device according to any one of claims 1 to 5.

The feature word extraction means extracts the generalized specific expression that matches the input article and the title of the similar article as the feature word,
The article output means outputs the input article by specializing the generalized specific expression extracted as the feature word into the specific expression;
The article feature word extraction device according to claim 6.

Title input means for receiving input of a title related to the input article created by the user;
Article storage means for storing the title and the input article in association with each other in the past article storage unit;
Comprising
The article feature word extraction device according to any one of claims 1 to 7.

An article feature word extraction method for extracting feature words for a title corresponding to an input article by a computer,
A past article storing step of storing past articles classified into categories and their titles in association with each other;
A proper expression storage step for storing the specific expression;
An article input step for accepting an input article with the category;
A similar article extraction step for extracting a similar article similar in category to the input article received in the article input step from the past article stored in the past article storage step;
A specific expression extraction step for extracting each of the specific expressions stored in the specific expression storage step from the input article and the title of the similar article;
A specific expression generalizing step of generalizing the specific expression extracted by the specific expression extracting step, and applying a generalized specific expression to the title of the input article and the similar article;
A feature word extraction step of extracting a feature word from the title of the input article and the similar article after applying the generalized specific expression by the specific expression generalization step;
Article feature word extraction method including

A program for causing a computer to execute the steps of the article feature word extraction method according to claim 9.