JP4466334B2

JP4466334B2 - Information classification method and apparatus, program, and storage medium storing program

Info

Publication number: JP4466334B2
Application number: JP2004324241A
Authority: JP
Inventors: 佳代池田; 伸治安部; 雅且大久保
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-11-08
Filing date: 2004-11-08
Publication date: 2010-05-26
Anticipated expiration: 2024-11-08
Also published as: JP2006134183A

Description

本発明は、情報分類方法及び装置及びプログラム及びプログラムを格納した記憶媒体に係り、特に、日々新しく発信される大量のＷｅｂ情報から、ユーザの興味のある検索キーワードによって入手した情報から、その時々の話題語を抽出し、分類するための情報分類方法及び装置及びプログラム及びプログラムを格納した記憶媒体に関する。 The present invention relates to an information classification method and apparatus, a program, and a storage medium storing the program, and in particular, from a large amount of Web information newly transmitted every day, from information obtained by a search keyword that is of interest to the user, The present invention relates to an information classification method and apparatus for extracting and classifying topic words, a program, and a storage medium storing the program.

近年では、日々刻々と情報が更新され、ユーザに提供されるようになってきた。特にインターネット上における情報の更新速度と増加量は著しい。このような中で、世の中の関心事に沿った話題に合わせて、ユーザが欲しい情報を入手することは、困難である。 In recent years, information has been updated every day and provided to users. In particular, the update speed and amount of information on the Internet is remarkable. Under such circumstances, it is difficult to obtain the information that the user wants in accordance with the topic in line with the interests of the world.

しかし、日々更新される情報の中には、世の中の関心事や新たな出来事、事件の経緯、流行等、ユーザが関心を示す多くの話題を含んでいる可能性が高い。そこで、これらの情報を分析することによって、ユーザの関心を示す話題などを抽出することができる。 However, it is highly likely that the information that is updated every day includes many topics that the user is interested in, such as public interests, new events, events, and trends. Therefore, by analyzing these pieces of information, it is possible to extract topics that indicate the user's interest.

また、多くの情報の中から所望のコンテンツを得ようとした場合、検索しただけではなかなか欲しい情報を得られない場合がある。そこで、入手した情報を自動分類する技術も多く提案されている。 In addition, when a desired content is obtained from a lot of information, there is a case where it is difficult to obtain desired information only by searching. Therefore, many techniques for automatically classifying the obtained information have been proposed.

話題の抽出においては、複数の文書情報から抽出する技術が複数提案されている。例えば、複数の話者の発信源内容を文書化したデータから会話の主題を表す語を抽出する技術がある。これは、会話内容の文書化データから形態素解析によって主題に名詞を切り出し、会話の流れの中でのそれらの出現頻度や出現間隔に基づいて、語の重み（話題を表す可能性）を決定する技術である。この技術においては、１発言中での利用頻度が高い語や、しばらく利用されなかった後に利用された語を、重要度が高いとして評価を行う（以下、第１の従来技術と記す）（例えば、特許文献１参照）。 In topic extraction, a plurality of techniques for extracting from a plurality of document information have been proposed. For example, there is a technique for extracting a word representing the subject of a conversation from data in which contents of transmission sources of a plurality of speakers are documented. In this method, nouns are extracted from the documented data of conversation content by morphological analysis, and the word weight (possibility of expressing the topic) is determined based on their appearance frequency and appearance interval in the conversation flow. Technology. In this technique, words that are frequently used in one statement or words that have been used after being used for a while are evaluated as having high importance (hereinafter referred to as the first conventional technique) (for example, , See Patent Document 1).

また、次々と送られてくる掛け合いのようなメッセージ情報から、その情報の勢いを算出し、その勢いの強い語句を話題語として抽出する技術である（以下、第２の従来技術と記す）（例えば、非特許文献１参照）。 In addition, it is a technique for calculating the momentum of the information from message information such as conversations sent one after another, and extracting the strong phrase as a topic word (hereinafter referred to as the second conventional technique) ( For example, refer nonpatent literature 1).

また、情報の分類においては、事前に決まった分類カテゴリに対して、どのような情報が当てはまるかを事前に学習し、未知の情報が入力されたときに、その情報がどの分類に当てはまるかを分析する技術が複数提案されている（以下、第３の従来技術と記す）（例えば、非特許文献２参照）。 In the classification of information, it learns in advance what kind of information applies to a predetermined classification category, and when unknown information is input, it determines which classification the information applies to. A plurality of techniques for analysis have been proposed (hereinafter referred to as a third conventional technique) (for example, see Non-Patent Document 2).

また、文書内の語の出現頻度によって語句ベクトルを算出し、文書間の類似性に応じて文書をクラスタリングする技術がある。また、その語句ベクトルの類似性に応じてクラスタを命名する技術がある（以下、第４の従来技術と記す）（例えば、特許文献２参照）。 In addition, there is a technique in which a phrase vector is calculated based on the appearance frequency of words in a document, and the documents are clustered according to the similarity between documents. In addition, there is a technique for naming clusters according to the similarity of the word vectors (hereinafter referred to as the fourth conventional technique) (see, for example, Patent Document 2).

この他にも検索エンジンにおいて、キーワードを入力した結果を自動分類するような技術もある。これは、オントロジーを分類の際に利用したり、もともと情報の分類カテゴリが付いている場合もある。
特許第２９３１５５３号公報特許第３３８５２９７号公報石井恵他、「名詞句と単語の勢いを用いた話題抽出手法の提案」情報処理学会研究報告−vol.2004-no.23, 2004-NL-160, pp.79-84 上田修功他、「多重トピックテキストの確率モデル・パラメトリック混合モデル」電子情報通信学会論文誌（D-II）,Vol.J87-DII, No.3, March 2004, pp.872-883 In addition, there is a technique for automatically classifying a result of inputting a keyword in a search engine. In some cases, an ontology is used for classification, or a classification category of information is originally attached.
Japanese Patent No. 2931553 Japanese Patent No. 3385297 Megumi Ishii et al., “Proposal of Topic Extraction Method Using Noun Phrase and Word Momentum” Information Processing Society of Japan Research Report-vol.2004-no.23, 2004-NL-160, pp.79-84 Nobuyoshi Ueda et al., “Probabilistic model / parametric mixed model of multi-topic text”, IEICE Transactions (D-II), Vol.J87-DII, No.3, March 2004, pp.872-883

上記第１の従来技術では、１発言中のある語の頻度と全体的に話されている話題とが結びつかない場合も多い。また、比較的よく用いられる語が特に集中的に高頻度で利用された場合も話題を表しているといえるが、そのような語の抽出にも適さない。 In the first prior art, the frequency of a certain word in one utterance and the topic being spoken in many cases are often not linked. Moreover, it can be said that a topic is also expressed when a relatively frequently used word is used intensively and frequently, but it is not suitable for extracting such a word.

また、第２の従来技術では、掛け合いのようなメッセージ情報から情報の勢いを算出するため、全く異なる観点で話されているような多数の文書を対象に話題語を抽出することには適していない。 In the second prior art, since the momentum of information is calculated from message information such as a query, it is suitable for extracting topic words for a large number of documents that are spoken from completely different viewpoints. Absent.

また、第３の従来技術では、事前に分類するカテゴリを決めておく必要があり、話題が次々と変化していくような情報には適していない。 Further, the third prior art needs to determine categories to be classified in advance, and is not suitable for information in which topics change one after another.

また、第４の従来技術では、多くの人が取り上げている話題語という観点でのクラスタリングではなく、文書の中でまず語句ベクトルを算出する方法をとっている。話題語としては、同一文書の中にその語句がどの程度出現しているかではなく、多数の文書で取り扱われている語句という観点で分類することが望ましい。よって、こういった方法では話題語の抽出には適していない。特に、Ｂｌｏｇやニュース、日記等のＷｅｂページでは、１ページ内に様々なトピックが記述されていることが多く、文書内の語句の出現頻度を用いたクラスタリングでは、話題を抽出することが困難になる。 In the fourth prior art, a method of calculating a phrase vector in a document is used instead of clustering from the viewpoint of topic words that are taken up by many people. It is desirable to classify the topic words from the viewpoint of words / phrases handled in a large number of documents, not how much the words / phrases appear in the same document. Therefore, this method is not suitable for extracting topic words. In particular, in web pages such as blogs, news, diaries, etc., various topics are often described in one page, and it is difficult to extract topics in clustering using the appearance frequency of words in a document. Become.

また、検索エンジンにて検索した結果を自動分類するような従来技術では、事前にオントロジーのようなものや辞書が必要であったり、カテゴリを分類しておく必要がある。こういった場合、情報が刻々と更新され、新しい流行や話題などを抽出するようなことには適していない。 In addition, in the conventional technology that automatically classifies the results searched by the search engine, it is necessary to have an ontology or dictionary in advance or classify the categories. In such a case, the information is constantly updated, which is not suitable for extracting new trends or topics.

本発明での話題語というのは、多くの文書で取り上げられているような語句であり、時間的変化（多くの文書に短期的に集中して出現しているような語句、長期的に多くの文書で取り上げられているような語句など）が挙げられる。また、その中でも魅力的な話題語としては、インパクトの強いものであり、内容がすぐにイメージできるようなものであることが望ましい。 The topic word in the present invention is a phrase that is taken up in many documents, and changes over time (a phrase that appears concentrated in many documents in the short term, many in the long term) For example). Among them, as an attractive topic word, it is desirable that the word has a strong impact and the contents can be immediately imaged.

本発明は、上記の点に鑑みなされたもので、事前学習することなく、リアルタイムに、あるキーワードに基づいて、次々と取得されるような文書データもしくは大量の文書データを解析し、そこからキーワードとの関連性と時間的な新しさという観点で、話題となっている語句を抽出し、その話題語によって文書データを分類することで、よりそのキーワードに関わる特色のある分類をすることが可能な情報分類方法及び装置及びプログラム及びプログラムを格納した記憶媒体を提供することを目的とする。 The present invention has been made in view of the above points, and analyzes document data or a large amount of document data obtained one after another based on a certain keyword in real time without prior learning. From the viewpoint of relevance and temporal freshness, it is possible to classify the topic data by classifying document data based on the topic word and then classifying the document data according to the topic word. An object of the present invention is to provide an information classification method and apparatus, a program, and a storage medium storing the program.

図１は、本発明の原理説明図である。 FIG. 1 is an explanatory diagram of the principle of the present invention.

本発明は、あるキーワードに基づいて取得した文書データから、話題語を抽出し、該話題語によって該文書データに分類することで、より該キーワードに関わる特色のある分類を行うための情報分類方法において、
指定されたキーワードを検索キーワードとして、文書データとなる更新日付、検索結果出力順位、本文（文章）もしくは、該本文の一部を含む文を検索結果から取得し、該本文もしくは、その該本文の一部を取得できない場合は、該検索結果から得られる文書データの公開場所を基に本文を補足収集し、文書データベースに格納するデータ収集ステップと、（ステップ１）と、
文書データベースから文書データを読み出して、該文書データから、話題語ルール記憶手段に格納されている品詞の組み合わせを用いた話題語ルールを参照して、話題語候補を抽出し、話題語データベースに格納する話題語候補抽出ステップ（ステップ２）と、
話題語集約ルール記憶手段に格納された話題語集約ルールに基づいて、話題語データベースから読み出された話題語候補を集約する話題語集約ステップ（ステップ３）と、
話題語データベースの話題語候補のそれぞれにおいて、該話題語を含む文書データと検索キーワードとの関連の高さ及び文書データの更新時刻によって話題語スコアを計算する話題語スコア算出ステップ（ステップ５）と、
話題語データベースの話題語候補それぞれが持つ話題語スコアと話題語候補を含む文書データとの関係から話題語を選定し、文書データを話題語毎に分類する文書分類ステップ（ステップ６）と、を行い、
話題語スコア算出ステップ（ステップ５）では、
話題語候補の話題語スコアを、該当する話題語候補の文字列を含む文書データから得られる文書話題語スコアで合計し、
文書話題語スコアを、該当する話題語候補の文字列を含む１文書データの検索結果出力順位と検索結果から得られる文書データの更新日付から決定する。 This onset Ming, from the document data obtained on the basis of a keyword, to extract the topic words, by classifying the said document data by the topic words, information classification for performing classification with distinctive involved more the keyword In the method
Using the specified keyword as a search keyword, the update date, the search result output order, the text (sentence), or a sentence including a part of the text is obtained from the search result, and the text or the text If a part of the text cannot be acquired, a data collection step of supplementarily collecting the text based on the publication location of the document data obtained from the search result and storing it in the document database; (Step 1);
Reading document data from the document database, extracting topic word candidates from the document data by referring to topic word rules using combinations of parts of speech stored in the topic word rule storage means, and storing them in the topic word database Topic word candidate extraction step (step 2)
Based on the topic word aggregation rules stored in the topic word aggregation rule storage means, a topic word aggregation step of aggregating the topic word candidate that is read from the topic word database (step 3),
In each topic word candidate topic word database, a topic word score calculating step (Step 5) calculating a topic word score by the relevant height and update time of the document data in the document data and the retrieval keyword including the topic word ,
A document classification step (step 6) for selecting a topic word from the relationship between the topic word score of each topic word candidate in the topic word database and the document data including the topic word candidate, and classifying the document data for each topic word. Done
In the topic word score calculation step (step 5),
The topic word scores of the topic word candidates are summed with the document topic word score obtained from the document data including the character string of the corresponding topic word candidate,
The document topic word score is determined from the search result output rank of one document data including the character string of the corresponding topic word candidate and the update date of the document data obtained from the search result.

上記のステップ１では、キーワードを基にユーザの欲する情報源から話題語抽出に必要な文書データ（更新日付、検索結果出力順位、本文もしくは本文の一部（サマリーやキーワードの前後の文章等）等）を検索結果から取得し、本文もしくはその一部を取得できない場合は、検索結果から文書の公開場所も文書データの一部として取得し、文書の公開場所を基にデータが補足収集される。本文全体が得られるような場合、もしくは、長文が得られるような場合は、その文書の中でも検索キーワードの前後の文章を取得し、本文とする。文書の中で検索キーワードが複数出てくる場合は、それらの付近の文章全てをまとめて本文として扱う。 In the above step 1, document data (update date required for the topic word extraction from the source of information that wants the user based on the keyword, the search result output order, before and after the sentence of the body or part of the body (summary, keywords, etc.) Etc.) from the search result, if the text or part of it cannot be obtained, the document publication location is also obtained from the search result as part of the document data, and the data is supplementally collected based on the document publication location. . When the entire text can be obtained or when a long sentence is obtained, the text before and after the search keyword is acquired in the document and used as the text. When multiple search keywords appear in a document, all the sentences in the vicinity are collectively treated as a text.

上記ステップ２の話題語候補抽出ステップでは、ステップ１で取得した文書データを基に（形態素解析を行い）、話題語ルールに基づき（表記の揺れを調整した後）話題語候補の抽出を行う。話題語ルールは、話題語ルール記憶手段などに格納されており、話題語として適する語句が抽出できるよう、一定の条件（品詞の組み合わせ等）が格納されている。なお、表記の揺れの調整については、実施の形態の欄で詳述する。 In the topic word candidate extraction step of step 2 above, topic word candidates are extracted based on the topic word rule (after adjusting the fluctuation of the notation) based on the document data acquired in step 1 (perform morphological analysis). The topic word rules are stored in a topic word rule storage means or the like, and certain conditions (part of speech combinations, etc.) are stored so that phrases suitable as topic words can be extracted. Note that the adjustment of the shaking of the notation will be described in detail in the section of the embodiment.

ステップ３の話題語集約ステップでは、話題語候補をある一定の条件である話題語集約ルール記憶手段などに格納されている話題語集約ルールに基づいて、同一の意味にとれるような語句同士を集約していく。話題語集約ルールについては、実施の形態の欄で詳述する。 In the topic word aggregation step of step 3, words and phrases that can have the same meaning are aggregated based on topic word aggregation rules stored in topic word aggregation rule storage means or the like, which are certain conditions. I will do it. The topic word aggregation rule will be described in detail in the section of the embodiment.

ステップ４の時刻経過検査ステップでは、ステップ３で集約した話題語候補に対して、その話題語の時間的鮮度を検査する。ある一定の期間Ｔ（正の整数）の間、話題語として抽出されて続けているような語句、もしくは、ＮＧワードリストに登録されているような語句を話題語候補から外す（また、ＮＧワードリストに存在せず、新たに話題語候補から外された語句があれば、それをＮＧワードリストへ追加する）。当該ステップ４は、特に同じ条件で定期的に検索を繰り返し実行するような場合に利用されるステップであり、一度きりの実行の場合は、飛ばしてもよい。 In the time elapsed inspection step in step 4, to the topic word candidate that consolidates in step 3, to examine the temporal freshness of the topic words. A word or phrase that has been extracted as a topic word for a certain period T (a positive integer) or a word or phrase that is registered in the NG word list is excluded from the topic word candidates (also NG words) If there are words that are not in the list and are newly excluded from the topic word candidates, they are added to the NG word list). The step 4 is a step used particularly when the search is repeatedly executed periodically under the same conditions, and may be skipped when executed once.

ステップ５の話題語スコア算出ステップは、話題語スコアを算出し、ステップ１で取得した文書データを話題語毎に出力する。話題語候補に付属する文書データの更新日付が検索出力順位を用いて、話題語スコアを算出する。このとき、更新日時が新しく、検索出力順位が上位の文書データから抽出されている話題語は、その文書データに対する文書話題語スコアが高くなる。そして、一つの話題語スコアは、各文書話題語スコアを合計することで決定する。 The topic word score calculation step of step 5 calculates a topic word score and outputs the document data acquired in step 1 for each topic word. The update date of the document data attached to the topic word candidate uses the search output order to calculate the topic word score. At this time, the topic word extracted from the document data with the newest update date and the highest search output rank has a high document topic word score for the document data. One topic word score is determined by summing up the document topic word scores.

ステップ６の文書分類ステップは、話題語スコアが高い順に文書データと共に話題語を分類し、出力することができる。 The document classification step of Step 6 can classify the topic words together with the document data in descending order of the topic word score and output them.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項１）は、あるキーワードに基づいて取得した文書データから、話題語を抽出し、該話題語によって該文書データに分類することで、より該キーワードに関わる特色のある分類を行うための情報分類装置であって、
話題語を抽出するため条件としての品詞の組み合わせを用いた話題語ルール２２を格納した話題語ルール記憶手段６２０と、
品詞の組み合わせからなり、同一の意味に取れるような話題語同士を集約するための話題語集約ルール２３を格納した話題語集約ルール記憶手段６３０と、
指定されたキーワードを検索キーワードとして、文書データとなる更新日付、検索結果出力順位、本文（文章）もしくは、該本文の一部を含む文を検索結果から取得し、該本文もしくは、その該本文の一部を取得できない場合は、該検索結果から得られる文書データの公開場所を基に本文を補足収集し、文書データベース２０に格納するデータ収集手段３００と、
文書データベース２０から文書データを読み出して、該文書データから、話題語ルール記憶手段６２０に格納されている話題語ルール２２を参照して、話題語候補を抽出し、話題語データベース２１に格納する話題語候補抽出手段３１０と、
話題語集約ルール記憶手段６３０に格納された話題語集約ルール２３に基づいて、話題語データベース２１から読み出された話題語候補を集約する話題語集約手段３２０と、
話題語データベース２１の話題語候補のそれぞれにおいて、該話題語を含む文書データと検索キーワードとの関連の高さ及び文書データの更新時刻によって話題語スコアを計算する話題語スコア算出手段３４０と、
話題語データベース２１の前記話題語候補それぞれが持つ話題語スコアが所定数以上の話題語候補に絞り、さらに、話題語候補の文字列を含む文書の数量が所定数以上の話題語候補に絞ることにより話題語を選定し、話題語スコアが高い順に文書データを話題語毎に分類する文書分類手段３５０と、を有し、
話題語スコア算出手段３４０は、
話題語候補の話題語スコアを、該当する話題語候補の文字列を含む文書データから得られる文書話題語スコアで合計する手段と、
文書話題語スコアを、該当する話題語候補の文字列を含む１文書データの検索結果出力順位と検索結果から得られる文書データの更新日付から決定する手段と、
を含む。 According to the present invention (claim 1 ), a topic word is extracted from document data acquired based on a certain keyword, and is classified into the document data based on the topic word, thereby further classifying the characteristic relating to the keyword. An information classification device for
A topic word rule storage means 620 storing a topic word rule 22 using a combination of parts of speech as a condition for extracting a topic word;
A topic word aggregation rule storage unit 630 that stores topic word aggregation rules 23 for aggregating topic words that have the same meaning and are composed of combinations of parts of speech ;
As specified keyword search keyword, update date on which the document data, the search result output ranking, text (sentence) or obtained from the search results sentence containing a portion of the body, the body or the said body The data collection means 300 for supplementarily collecting the text based on the public location of the document data obtained from the search result and storing it in the document database 20;
Topic stored in the topic word database 21 by reading out document data from the document database 20, extracting topic word candidates from the document data with reference to the topic word rule 22 stored in the topic word rule storage unit 620. Word candidate extraction means 310;
Based on the topic word aggregation rule 23 topic word stored in the aggregation rule storage unit 630, a topic word aggregation unit 320 for aggregating topic word candidate that is read from the topic word database 21,
In each topic word candidate topic word database 21, a topic word score calculating means 340 for calculating a topic word score by the update time of the relevant height and the document data of the search keyword and document data including the topic word,
The topic word score of each of the topic word candidates in the topic word database 21 is narrowed down to a topic word candidate having a predetermined number or more, and the number of documents including the character string of the topic word candidate is narrowed to a topic word candidate having a predetermined number or more. the selected topic word, possess a document classification means 350 for the topic word score to classify the document data for each topic word in descending order, and
The topic word score calculation means 340
Means for summing the topic word scores of the topic word candidates with the document topic word score obtained from the document data including the character string of the corresponding topic word candidate;
Means for determining a document topic word score from a search result output rank of one document data including a character string of a corresponding topic word candidate and an update date of document data obtained from the search result;
Including

また、本発明（請求項２）は、上記の請求項１の情報分類装置において、
話題語集約手段３２０は、
話題語として、
話題語ルールによって予め定義された規則を満たすような品詞の組み合わせを用いた語句、
検索キーワードに包含されない語句、
話題語としてふさわしくない語句が格納されているＮＧワードリスト２４に存在しない語句、
話題語集約ルールに基づいて同一の意味に取れるような語句同士を一つの話題語として集約・選定された語句、を抽出する手段を含む。 The present invention (Claim 2 ) is the information classification apparatus according to Claim 1 ,
The topic word aggregation means 320
As a topic word,
A phrase using a combination of parts of speech that satisfies the rules predefined by the topic word rule,
Terms not included in search terms,
A phrase that does not exist in the NG word list 24 in which phrases that are not suitable as topic words are stored;
Means for extracting phrases that are aggregated and selected as one topic word from phrases that can have the same meaning based on the topic word aggregation rule.

また、本発明（請求項３）は、上記の請求項１の情報分類装置において、話題語による情報分類を繰り返し実行する場合に、一度抽出された話題語が、ある一定期間Ｔ（正の整数）の間、抽出され続けているようなときには、当該語句を話題語から除外する時刻経過検査手段３３０を更に有する。 Further, the present invention (Claim 3), in the information classification apparatus according to claim 1 above, when repeatedly executing the information classification by topic words, the topic word is once extracted, a certain period of time T (positive integer ), The time passage inspection means 330 for excluding the phrase from the topic word is further included.

上記のデータ収集手段３００では、外部入力、もしくは、事前に設定されたキーワードを基に、検索エンジンから対象文書データを収集する。そして、文書データベース２０へ格納する。本文全体が得られないような場合、もしくは、長文が得られるような場合は、その文書の中でも検索キーワードの前後の文章を取得し、本文とする。文章の中で検索キーワードが複数出てくる場合は、それらの付近の文章全てをまとめて本文として扱う。 In the above SL data acquisition unit 300, an external input, or based on pre-set keyword, collects target document data from the search engine. Then, it is stored in the document database 20. When the entire text cannot be obtained, or when a long sentence is obtained, the text before and after the search keyword is acquired and used as the text. When multiple search keywords appear in a sentence, all the sentences in the vicinity are treated as a body.

話題語候補抽出手段３１０は、データ収集手段３００で取得した文書情報を文書データベース２０から取得し、それを話題語ルール２２に基づき話題語候補の抽出を行う。話題語ルール２２は、話題語ルール記憶部６２０に格納されており、話題語として適する語句が抽出できるよう、一定の条件（品詞の組み合わせ等）が格納されている。なお、表記の揺れの調整については、実施の形態の欄で詳述する。抽出された話題語候補を話題語データベース２１に格納する。この際、話題語データベース２１には、関連する文書情報も格納される。 The topic word candidate extraction unit 310 acquires the document information acquired by the data collection unit 300 from the document database 20 and extracts the topic word candidate based on the topic word rule 22. The topic word rule 22 is stored in the topic word rule storage unit 620, and certain conditions (part of speech combinations, etc.) are stored so that words suitable as topic words can be extracted. Note that the adjustment of the shaking of the notation will be described in detail in the section of the embodiment. The extracted topic word candidates are stored in the topic word database 21. At this time, the topic word database 21 also stores related document information.

話題語集約手段３２０は、話題語データベース２１の話題語候補を、ある一定の条件である話題語集約ルール記憶部６３０に格納されている話題語集約ルール２３に基づいて、同一の意味にとれるような語句同士を集約していく。ある一定の期間Ｔ（正の整数）の間、話題語として抽出され続けているような語句を話題語候補から外す。話題語集約ルール２３は、品詞による組み合わせ等で示されている。当該ルールの詳細については実施の形態の欄で詳述する。 The topic word aggregation means 320 can take the same meaning of the topic word candidates in the topic word database 21 based on the topic word aggregation rule 23 stored in the topic word aggregation rule storage unit 630 which is a certain condition. Concentrate words together. A phrase that continues to be extracted as a topic word for a certain period T (a positive integer) is excluded from the topic word candidates. The topic word aggregation rule 23 is indicated by a combination of parts of speech. Details of the rule will be described in detail in the section of the embodiment.

時刻経過検査手段３３０では、話題語集約手段３２０で抽出した話題語候補を話題語ＤＢ２１より取得し、その語句の時間的鮮度を検査する。また、ＮＧワードリスト記憶手段６４０に格納されているＮＧワードリスト２４に登録されているような語句を話題語候補から外す。また、ＮＧワードリスト２４に存在せず、新たに話題語候補から外された語句があれば、それをＮＧワードリストへ追加する。残った話題語候補は、話題語データベース２１へ再度格納される。この際、話題語データベース２１には集約された話題語候補の情報とそれらに関連する文書情報も関連付けて格納される。 The time passage inspection unit 330 acquires the topic word candidates extracted by the topic word aggregation unit 320 from the topic word DB 21 and inspects the temporal freshness of the words. In addition, words such as those registered in the NG word list 24 stored in the NG word list storage unit 640 are excluded from the topic word candidates. In addition, if there is a phrase that does not exist in the NG word list 24 and is newly excluded from the topic word candidates, it is added to the NG word list. The remaining topic word candidates are stored again in the topic word database 21. At this time, the topic word database 21 stores the aggregated topic word candidate information and associated document information.

この手段は、特に、同じ条件で定期的に検索を繰り返し実行するような場合に利用される手段であり、一度きりの実行の場合は飛ばしてもよい。 This means is used especially when the search is repeatedly executed periodically under the same conditions, and may be skipped when executed once.

話題語スコア算出手段３４０は、話題語データベース２１から、話題語候補に付属する文書データの更新日付や検索出力順位を用いて、話題語スコアを算出する。このとき、更新日付が新しく、検索出力順位が上位の文書データから抽出されている話題語候補は、その文書データに対する文書話題語スコアが高くなる。そして、一つの話題語スコアは、各文書話題語スコアを合計することで決定する。 The topic word score calculation means 340 calculates the topic word score from the topic word database 21 using the update date and search output order of the document data attached to the topic word candidate. At this time, the topic word candidate having a new update date and extracted from the document data with the higher search output order has a higher document topic word score for the document data. One topic word score is determined by summing up the document topic word scores.

文書分類手段３５０は、話題語スコアが高い順に文書データと共に、話題語を出力する。また、話題語スコアを話題語データベース２１に格納することも可能である。 The document classification unit 350 outputs topic words together with document data in descending order of topic word score. Further, the topic word score can be stored in the topic word database 21.

本発明（請求項４）は、請求項１乃至３のいずれか１項に記載の情報分類装置を構成する各手段としてコンピュータを機能させるための情報分類プログラムである。 The present invention (Claim 4 ) is an information classification program for causing a computer to function as each means constituting the information classification apparatus according to any one of Claims 1 to 3.

本発明（請求項５）は、請求項４に記載の情報分類プログラムを格納した、コンピュータ読み取り可能な記録媒体である。

The present invention (Claim 5 ) is a computer-readable recording medium storing the information classification program according to Claim 4 .

これにより、リアルタイムにあるキーワードに基づいて、次々と取得される文書データもしくは、大量の文書データを解析し、そこからキーワードとの関連性と時間的な新しさという観点で、話題となっている語句を抽出し、その話題語によって文書データを分類することで、よりそのキーワードに関わる特色のある分類をすることが可能な、話題語による検索結果の分類方法を提供することができる。 This makes it possible to analyze document data or a large amount of document data that is acquired one after another based on keywords in real time, and from there, it is a topic in terms of relevance to keywords and temporal freshness. It is possible to provide a method of classifying search results by topic words, which can extract a phrase and classify document data according to the topic words, thereby enabling more characteristic classification related to the keywords.

上記のように、本発明によれば、主に検索結果で得られる本文のサマリーや検索キーワードの周囲の文書等から話題語を抽出するため、その話題語がキーワードに深く関連している可能性が高くなる。よって、より深くキーワードに関連した話題語を抽出することができる。 As described above, according to the present invention, a topic word is extracted mainly from a summary of a text obtained from a search result, a document around a search keyword, or the like, and therefore, the topic word may be deeply related to the keyword. Becomes higher. Therefore, it is possible to extract a topic word related to the keyword more deeply.

また、検索された文書によっては、あらゆる事柄について、網羅的に記述されているものもあるため、通常の分類装置では、分類不可能、または、特色を持たない文書と判断されてしまう場合がある。本発明では、検索結果により検索キーワードの付近の文書または、キーワードに関わる文書のサマリーを取得することで、そのキーワード周辺、つまり欲しい情報の話題について分類することが可能となる。 In addition, some retrieved documents contain all matters in an exhaustive manner, so a normal classification device may determine that a document cannot be classified or has no special features. . In the present invention, it is possible to classify the vicinity of the keyword, that is, the topic of the desired information by obtaining a document in the vicinity of the search keyword or a summary of the document related to the keyword from the search result.

また、事前学習することなく、リアルタイムにあるキーワードを元に取得した、次々と更新され取得される文書データもしくは大量の文書データを解析し、そこから話題となっている語句を抽出することも可能となる。 In addition, it is possible to analyze the document data or a large amount of document data acquired and updated one after another based on keywords in real time without prior learning, and extract the topical phrases from there. It becomes.

また、文書の更新日付などにより話題語の鮮度を考慮した話題語スコア付けを行うことで、単なる検索結果の分類と異なり、その結果から見られる話題、つまり流行やユーザの関心事などを基に分類することになるため、ユーザに分かりやすく、また、その時流を反映した分類となる。 Also, by performing topic word scoring that takes into account the freshness of topic words based on the date of update of the document, etc., it is different from simple classification of search results, and based on topics that are seen from the results, that is, trends and user concerns, etc. Since the classification is performed, the classification is easy for the user to understand and reflects the current time.

以上のように、「キーワードに関わる周辺の文書から話題語を抽出する」「話題語は抽出した文書の新しさと検索結果の出力順位が高い語句ほど話題語スコアが高くなる」を実行することで、従来の技術よりも話題というユーザにとって分かりやすく、その時々の流行などにより自在に変化する分類を行うことができるようになる。 As described above, execute “extract topic words from surrounding documents related to keywords” and “topic words have higher topic word scores as the extracted documents are higher in terms and the output ranking of search results” Therefore, it becomes easier for the user to talk about the topic than the conventional technology, and the classification can be freely changed according to the fashion at the time.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

最初に本発明の概要を説明する。 First, the outline of the present invention will be described.

図３は、本発明の概要を説明するための図である。 FIG. 3 is a diagram for explaining the outline of the present invention.

本発明は、リアルタイムに、あるキーワードに基づいて、次々と取得される文書データ、もしくは、大量の文書データを解析し、そこからキーワードとの関連性と時間的な新しさという観点で、話題となっている語句を抽出し、その話題語によって文書データを分類することで、よりそのキーワードに関わる特色のある分類をするものである。 The present invention analyzes document data acquired one after another or a large amount of document data based on a certain keyword in real time, and from the viewpoint of relevance to the keyword and temporal newness, The extracted words are classified, and the document data is classified according to the topic words, thereby classifying the keywords more specifically with respect to the keywords.

ステップ１０では、キーワードを基に、ユーザの欲する情報源から話題語抽出に必要な文書データ（更新日付、検索結果出力順位、本文、もしくは、本文の一部（サマリーやキーワードの前後の文章等）等）を検索結果から取得し、本文もしくはその一部を取得できない場合は、検索結果からＵＲＬ（ＵＲＩ）等の文書の公開場所の文書データの一部として取得し、ＵＲＬ（ＵＲＩ）等の文書の公開場所を基にデータが補足収集される。本文全体が得られるような場合、もしくは、長文が得られるような場合は、その文書中でも検索キーワードの前後の文書を取得し、本文とする。文章の中で検索キーワードが複数出てくる場合は、それらの付近の文章全てをまとめて本文として扱う処理を行う。 In step 10, based on the keyword, the document data necessary for extracting the topic word from the information source desired by the user (update date, search result output order, text, or part of the text (summary, text before and after the keyword, etc.) Etc.) from the search result, and if the text or a part thereof cannot be acquired, it is acquired from the search result as a part of the document data at the publication location of the document such as URL (URI), and the document such as URL (URI) Supplementary data is collected based on the public location. When the entire text can be obtained, or when a long text is obtained, the documents before and after the search keyword are acquired and used as the text. When a plurality of search keywords appear in a sentence, all of the sentences in the vicinity are processed as a body text.

ステップ２０では、ステップ１０で取得した文書データを基に形態素解析を行い、話題語ルールに基づき表記の揺れを調整した後、話題語ルールを用いて話題語候補の抽出を行う。話題語ルールは、話題語ルール記憶部などの記憶手段に格納されており、話題語として適する語句が抽出できるよう、一定の条件（品詞の組み合わせ等）が格納されている。ここで表記の揺れの調整とは、文書の書き手によって表記の方法が様々であるから、同一語ととれる候補が取得された場合、一つにまとめてしまうことを指す。例えば「犬」と「イヌ」、「沈殿」と「沈澱」、「インタフェース」と「インターフェース」、「大日本帝国」と「大日本帝國」、「打合せ」・「打ち合せ」・「打合せ」・「打ち合わせ」など。表記の揺れに関するルールについても、話題語ルールの中に格納されている。また、話題語候補がキーワードと同じ、もしくは、包含されてしまうような場合は、話題語として適さないため話題語候補から外す。例えば、キーワードが「甲子園球場」の時、話題語候補として「甲子園球場」、「甲子園」、「球場」が取得された場合、これら３つの語句が文字列として、キーワードに全て包含されているため候補から外すことができる。 In step 20, morphological analysis is performed based on the document data acquired in step 10, and after adjusting the notation fluctuation based on the topic word rule, topic word candidates are extracted using the topic word rule. The topic word rules are stored in a storage means such as a topic word rule storage unit, and certain conditions (such as combinations of parts of speech) are stored so that phrases suitable as topic words can be extracted. Here, the adjustment of the notation fluctuation means that, since there are various notation methods depending on the writer of the document, when candidates that can be taken as the same word are acquired, they are combined into one. For example, “dog” and “dog”, “precipitation” and “sedimentation”, “interface” and “interface”, “Daikoku Empire” and “Daikoku Empire”, “meeting”, “meeting”, “meeting”, “ Meeting ". Rules relating to the shaking of the notation are also stored in the topic word rules. If the topic word candidate is the same as or included in the keyword, the topic word candidate is excluded from the topic word candidate because it is not suitable as a topic word. For example, when the keyword is “Koshien Stadium” and “Koshien Stadium”, “Koshien”, and “Baseball Stadium” are acquired as topic word candidates, these three phrases are all included in the keyword as character strings. Can be removed from the candidate.

ステップ３０では、話題語候補をある一定の条件：話題語集約ルール記憶部などの記憶手段に格納されている話題語集約ルールに基づいて、同一の意味にとれるような語句同士を集約していく。話題語集約ルールは、品詞による組み合わせ等で示されており、例えば、次のような場合に適用できる。「プログラミング言語」「プログラミングの言語」や、キーワードが「犬」の場合、「犬のトリーミング」「トリーミング」は、両者が同じ意味を持つ言葉と判断することができるため、どちらか一方に集約することができる。 In step 30, words that can be taken to have the same meaning are aggregated based on a certain condition: topic word aggregation rules stored in storage means such as a topic word aggregation rule storage unit. . The topic word aggregation rule is indicated by a combination of parts of speech or the like, and can be applied to the following cases, for example. If "programming language", "programming language", or the keyword is "dog", "dog trimming" and "trimming" can be judged as words that have the same meaning, so they are aggregated in either one be able to.

ステップ４０では、ステップ３０で集約した話題語候補に対して、その話題語の時間的鮮度を検査する。ある一定の期間Ｔ（正の整数）の間、話題語として抽出され続けているような語句を話題語候補から外す。ＮＧワードリストとして記憶手段に登録されているような語句を話題語候補から外す。また、ＮＧワードリストに存在せず、新たに話題語候補から外された語句があれば、それをＮＧワードリストへ追加する当該ステップ４０は必須ではない。 In step 40, the temporal freshness of the topic word is examined for the topic word candidates collected in step 30. A phrase that continues to be extracted as a topic word for a certain period T (a positive integer) is excluded from the topic word candidates. Phrases that are registered in the storage means as an NG word list are removed from the topic word candidates. Further, if there is a phrase that does not exist in the NG word list and is newly excluded from the topic word candidates, the step 40 of adding it to the NG word list is not essential.

なお、ステップ４０は、特に同じ条件で定期的に検索を繰り返し実行するような場合に利用される処理であり、本発明の第２の実施の形態で説明する。一度きりの実行の場合は当該ステップの処理を飛ばしてもよい（第１の実施の形態）。 Step 40 is a process that is used particularly when the search is repeatedly executed periodically under the same conditions, and will be described in the second embodiment of the present invention. In the case of execution only once, the process of the step may be skipped (first embodiment).

ステップ５０では、話題語スコアを算出し、ステップ１０で取得した文書データを話題語毎に出力する。話題語候補に付属する文書データの更新日付や検索出力順位を用いて、話題語スコアを算出する。このとき、更新日付が新しく、検索出力順位が上位の文書データから抽出されている話題語は、その文書データに対する文書話題語スコアが高くなる。そして、一つの話題語スコアは、各文書話題語スコアを合計することで決定する。話題語スコアが高い順に文書データと共に話題語を出力する。 In step 50, a topic word score is calculated, and the document data acquired in step 10 is output for each topic word. The topic word score is calculated using the update date of the document data attached to the topic word candidate and the search output order. At this time, a topic word extracted from document data with a new update date and a higher search output order has a higher document topic word score for the document data. One topic word score is determined by summing up the document topic word scores. The topic words are output together with the document data in descending order of the topic word score.

［第１の実施の形態］
本発明の第１の実施の形態では、あるキーワードを基に取得したＷｅｂページ等からの文書から話題語を抽出し、その話題語毎にその文書をクラスタリングすることを目的とする。この文書は、話題語を抽出するという目的から、何らかの特徴を持っていた方が適する。例えば、Weblogやニュース、一般的なＷｅｂページの新着ページを基にするとその時々で盛り上がっている話題が抽出できる。また、ある特定の事柄について説明しているようなページ、例えば、フランスの歴史について複数ページにわたって説明している文書であればその歴史の主な話題を抽出することもできる。処理の詳細は以下に説明する。 [First Embodiment]
An object of the first embodiment of the present invention is to extract topic words from a document from a Web page or the like acquired based on a certain keyword, and to cluster the documents for each topic word. This document should have some features for the purpose of extracting topic words. For example, based on Weblog, news, and new arrival pages of general Web pages, it is possible to extract topics that are exciting from time to time. In addition, if a page describes a specific matter, for example, a document explaining a history of France over a plurality of pages, a main topic of the history can be extracted. Details of the processing will be described below.

図４は、本発明の第１の実施の形態における話題語による情報分類装置の構成を示す。 FIG. 4 shows the configuration of the information classification device based on topic words in the first embodiment of the present invention.

同図に示す話題語による情報分類装置は、コンピュータ１０とこのコンピュータ１０にネットワーク４０を介して接続される文書データベース（文書ＤＢ）２０と、話題語データベース（話題語ＤＢ）２１，話題語ルールを格納する話題語ルール記憶部６２０、話題語集約ルール２３を格納する話題語集約ルール記憶部６３０で構成される。 The information classification device based on topic words shown in FIG. 1 includes a computer 10, a document database (document DB) 20 connected to the computer 10 via a network 40, a topic word database (topic word DB) 21, and topic word rules. The topic word rule storage unit 620 stores the topic word aggregation rule storage unit 630 that stores the topic word aggregation rule 23.

コンピュータ１０は、ＲＡＭ，ＲＯＭ、磁気ディスク等からなるメモリ、ＣＰＵ、ディスプレイによる表示部１１、及びマウスやキーボードなどからなる指示入力部１２から構成されており、ＣＰＵが実行するソフトウェアプログラムによって実現されるデータ収集処理部５００、話題語候補抽出処理部５１０、話題語集約処理部５２０、話題語スコア算出処理部５４０、及び文書分類処理部５５０とを備えている。 The computer 10 includes a memory such as a RAM, a ROM, and a magnetic disk, a CPU, a display unit 11 using a display, and an instruction input unit 12 such as a mouse and a keyboard, and is realized by a software program executed by the CPU. A data collection processing unit 500, a topic word candidate extraction processing unit 510, a topic word aggregation processing unit 520, a topic word score calculation processing unit 540, and a document classification processing unit 550 are provided.

文書ＤＢ２０には、話題語抽出対象となるコンテンツのＵＲＬ（ＵＲＩ）、タイトル、本文内容（概要など）を表すテキスト文章、更新日時、検索結果の出力順位などのメタ情報が格納される。 The document DB 20 stores meta information such as a URL (URI) of a content to be subject to topic word extraction, a title, a text sentence representing the content of the text (such as an outline), an update date and time, and an output order of search results.

話題語ＤＢ２１には、コンテンツから抽出された話題語候補とその話題語候補に関連するコンテンツの数とそのコンテンツの情報、話題語選定処理により選定された話題語とその話題語に関連するコンテンツの数とそのコンテンツの情報、また、話題語候補となったが、話題語選定処理中に集約された話題語候補が格納される。 In the topic word DB 21, the number of topic words extracted from the content, the number of contents related to the topic word candidate, information about the content, the topic word selected by the topic word selection process, and the contents related to the topic word are stored. The number and content information, and the topic word candidates, but the topic word candidates aggregated during the topic word selection process are stored.

話題語ルール２２は、話題語ルール記憶部６２０に格納され、話題語を抽出するための条件（品詞の組み合わせ等）が記述されている。このルールは、追加変更などが自在に行うことができる。 The topic word rule 22 is stored in the topic word rule storage unit 620 and describes conditions (part of speech combination, etc.) for extracting the topic word. This rule can be freely added and changed.

話題語集約ルール２３は、話題語集約ルール記憶部６３０に格納され、同一の意味にとれるような話題語同士を集約するための条件（品詞の組み合わせ等）が記述されている。このルール２３は、追加変更などを自在に行うことができる。 The topic word aggregation rule 23 is stored in the topic word aggregation rule storage unit 630 and describes conditions (such as combinations of parts of speech) for aggregating topic words that can have the same meaning. This rule 23 can be freely added and changed.

データ収集処理部５００、話題語候補抽出処理部５１０、話題語集約処理部５２０や話題語スコア算出処理部５４０は、このように構成される話題語抽出システムの基であり、以降に説明する処理を実行することで本発明を実現するように動作する。 The data collection processing unit 500, the topic word candidate extraction processing unit 510, the topic word aggregation processing unit 520, and the topic word score calculation processing unit 540 are the basis of the topic word extraction system configured as described above, and will be described below. It operates so that this invention may be realized by performing.

以下のその処理を説明する。 The process will be described below.

（１）データ収集処理部５００：
図５は、本発明の第１の実施の形態におけるデータ収集処理のフローチャートである。 (1) Data collection processing unit 500:
FIG. 5 is a flowchart of the data collection process in the first embodiment of the present invention.

ステップ５０１）まず、データ収集処理部５００は、外部入力や事前の設定値より、話題語抽出対象先や話題語を抽出したい関連キーワード等のパラメータを取得する。このパラメータの内容は、必要に応じて様々なものを用いることができる。例えば、
（a）抽出情報対象先：既存のＤＢや、インターネット上のＷｅｂページ全般、Weblog、ニュース記事等；
（b）関連キーワード：ユーザの得たい情報に関連するキーワードや特に関心のある話題等；
である。（a）は、（ｂ）のキーワードを基に検索を行うので、検索できるデータ（ＤＢもしくは検索エンジンを利用できるようなデータ）である必要がある。（a）でＤＢを利用しない場合、検索エンジンは一般に公開されている検索サイトを利用してもよいし、あるいは事前に検索サーバを構築し、設定しておくことで実現できる。 Step 501) First, the data collection processing unit 500 acquires a topic word extraction target and parameters such as related keywords for which a topic word is to be extracted from an external input or a preset value. Various parameters can be used as necessary. For example,
(A) Extraction information target: Existing DB, general web pages on the Internet, web logs, news articles, etc .;
(B) Related keywords: Keywords related to information that the user wants to obtain, topics of particular interest, etc .;
It is. Since (a) performs a search based on the keyword of (b), it must be searchable data (data that can use a DB or a search engine). When the database is not used in (a), the search engine may use a publicly available search site, or can be realized by building and setting a search server in advance.

ステップ５０２）次に取得した抽出情報対象先へ関連キーワードを検索クエリとして送る。この際、検索先が検索エンジンの場合、ＵＲＬ（ＵＲＩ）アドレス内に検索キーワードを付けて検索エンジンへ送るだけで検索結果が得られるサイトも存在する。例えば、gooblog検索の場合、
「http://blog.goo.ne.jp/search/serch.php?status=select&tg=all&ts=goo&st=time&dc=10&dp=all&ts=all&MT=検索キーワード&da=all」
のようなアドレスを送るだけで「検索キー」を検索キーワードとした検索結果（検索結果出力順位、ページのタイトル、更新日付、検索キーワードを含む前後の文章等）が返ってくる。但し、この検索キーワードは、ＵＲＬエンコード（エスケープ）する必要がある。例えば、「サッカー」は「%A5%B5%A5%C3%A5%AB%A1%BC」となる。 Step 502) A related keyword is sent as a search query to the acquired extracted information target. At this time, when the search destination is a search engine, there are sites where a search result can be obtained simply by adding a search keyword in a URL (URI) address and sending it to the search engine. For example, for gooblog search,
"Http://blog.goo.ne.jp/search/serch.php?status=select&tg=all&ts=goo&st=time&dc=10&dp=all&ts=all&MT=search keyword & da = all"
A search result (search result output order, page title, update date, text before and after the search keyword, etc.) is returned by simply sending an address such as However, this search keyword needs to be URL encoded (escaped). For example, “soccer” becomes “% A5% B5% A5% C3% A5% AB% A1% BC”.

既存のＤＢからデータを取得する場合は、本文全体から得られる場合もある。また、検索エンジンによっては、コンテンツの概要を検索結果として返す場合もある。いずれの場合もコンテンツを説明している本文とみなし、以下で扱うこととする。 When acquiring data from an existing DB, it may be obtained from the entire text. Depending on the search engine, an outline of the content may be returned as a search result. In either case, the content is regarded as a body describing the content and will be handled below.

本文が得られるような場合、もしくは、長文が得られるような場合は、その文書の中でも検索キーワードの前後の文章を取得し、本文とする。例えば、キーワードを真ん中に前後２５６文字を取得することや、キーワードを含む文書に加えて前後の１文章を取得することなどでもよい。文章の中で検索キーワードが複数出てくる場合は、それらの付近の文章全てをまとめて本文として扱う。 When the text can be obtained or when a long sentence is obtained, the text before and after the search keyword is acquired in the document and used as the text. For example, it is possible to acquire 256 characters before and after the keyword in the middle, or to acquire one sentence before and after the document including the keyword. When multiple search keywords appear in a sentence, all the sentences in the vicinity are treated as a body.

ステップ５０３）上記のステップで取得したコンテンツ情報を文書ＤＢ２０へ格納する。 Step 503) The content information acquired in the above step is stored in the document DB 20.

（２）話題語候補抽出処理部２０：
図６は、本発明の第１の実施の形態における話題語候補抽出処理のフローチャートである。 (2) Topic word candidate extraction processing unit 20:
FIG. 6 is a flowchart of topic word candidate extraction processing according to the first embodiment of this invention.

ステップ６０１）まず、話題語候補抽出処理部２０は、文書ＤＢ２０より対象文書情報を取り出す。 Step 601) First, the topic word candidate extraction processing unit 20 extracts target document information from the document DB 20.

ステップ６０２）次に、その文書を形態素解析する。また、その後、それぞれの語句の表記の揺れを調整する。形態素解析を行う対象は、データ収集処理部５００で取得した文書の本文やタイトルである。また、ここで表記の揺れの調整とは、文書の書き手によって表記の方法が様々であることから、同一語と取れる候補が取得された場合、一つにまとめてしまうことをさす。例えば、「犬」と「イヌ」、「沈殿」と「沈澱」、「インタフェース」「インターフェース」、「大日本帝国」と「大日本帝國」、「打合せ」・「打合わせ」・「打合わせ」・「打ち合わせ」など。 Step 602) Next, the document is subjected to morphological analysis. After that, the fluctuation of the notation of each word is adjusted. The target for the morphological analysis is the text and title of the document acquired by the data collection processing unit 500. In addition, the adjustment of the notation here means that the notation method varies depending on the writer of the document, and therefore when candidates that can be taken as the same word are acquired, they are combined into one. For example, “dog” and “dog”, “precipitation” and “precipitation”, “interface” “interface”, “Dainippon Empire” and “Daikoku Empire”, “meeting”, “meeting”, “meeting”, “ Meeting ".

ステップ６０３）形態素解析結果から、話題語候補を抽出する。話題語は、扱う文章や分野、また、ユーザの好みによっても適する語句が異なる場合がある。本実施の形態では、次のような理由から３種類を話題語とし、名詞句を抽出することにし、そのルールは、話題語ルール２２に記述されていることとする。 Step 603) Extract topic word candidates from the morphological analysis results. A topic word may have a different suitable phrase depending on a sentence or a field to be handled or a preference of a user. In the present embodiment, three types of topic words are used for the following reasons, and noun phrases are extracted, and the rules are described in the topic word rule 22.

まず、多くの人がその内容について取り上げているもので、時間的な変化（短期的に集中することや長期で話されていること）があるものが、話題語として適していると言える。そして、その中でもユーザの興味をひくようなインパクトが強いものや一見してコンテンツの内容がイメージできるものの方が、利用価値が高い。 First of all, many people take up the content, and those that change over time (concentrated in the short term or spoken in the long term) are suitable as topic words. Among them, those that have a strong impact that attracts the user's interest and those that can give an image of the content at first glance have higher utility value.

インパクトの強い言葉としては、固有名詞や新しい言葉が上げられる。新しい言葉は、形態素解析の辞書に登録されていないことが多いため、未知語の連続として扱われる。そこで、今回は、カタカナの未知語の連続を固有名詞とし、アルファベットの未知語の連続を名詞と位置付けて採用することとする。 Proper nouns and new words are examples of words that have a strong impact. New words are often not registered in the morphological analysis dictionary and are therefore treated as a series of unknown words. Therefore, this time, we will adopt a series of unknown words in Katakana as proper nouns and a sequence of unknown words in alphabet as nouns.

一見してコンテンツの内容がイメージできる言葉は、それ自身で具体性のある固有名詞や連続することで具体性の高まる名詞の連続、また名詞同士を格助詞の「の」ではさんだ言葉等があげられる。また、その言葉を一見してイメージできるものとして長すぎる言葉は、ユーザが読み上げなくてはならなくなるため適さない。よって、一定の長さ以下であることが望ましいと言える。 At first glance, words that can be used to imagine the content of the content include specific proper nouns that are specific to each other, a series of nouns that become more specific when they are consecutive, and words that sandwich nouns with the case particle “no”. It is done. Also, words that are too long to be imaged at first glance are not suitable because the user must read them aloud. Therefore, it can be said that it is desirable that it is below a certain length.

以上のことから、話題語として扱われる中でも固有名詞、名詞の連続、名詞+各助詞「の」+名詞を一定の長さ以下で記述された名詞句を本実施の形態では採用することとする。以下に、その正規表現を示す。「？」は、直前の語の「０」かまたは１回の出現を意味し、「│」は、その記号の前後の語のｏｒをとることであり、「｛Ａ，Ｂ｝」は、直前の語のＡ回以上、Ｂ回以下の繰り返しを意味する。但し、Ａ，Ｂは正の整数である。また、ａ：格助詞「の」、ｎ：名詞、Ｎ：固有名詞、ｐ：名詞接頭時、ｓ：名詞接尾辞を表す。 From the above, even though it is treated as a topic word, a noun phrase in which a proper noun, a series of nouns, a noun + each particle "no" + noun is described with a certain length or less is adopted in this embodiment. . The regular expression is shown below. “?” Means “0” or the first occurrence of the previous word, “|” means to take or of the word before and after the symbol, and “{A, B}” It means the repetition of the previous word A times or more and B times or less. However, A and B are positive integers. Also, a: case particle “no”, n: noun, N: proper noun, p: noun prefix, s: noun suffix.

１．（ｐ？（ｎ│Ｎ)ｓ？)｛２，４｝
２．（ｐ？（ｎ│Ｎ）ｓ？）｛１，３｝ａ（ｐ？（ｎ│Ｎ）｛１，３｝
３．Ｎ
形態素解析した結果を繋ぎ合わせ、上記のいずれかに該当する単語列の最長マッチングを話題候補として取得する。これに従うと、例えば、「テロ組織の犯行声明」、「シフォンケーキ」「アメリカ大統領選挙」等が取得できる。これ以外にも形容詞と名詞の組み合わせや形容動詞と名詞の組み合わせ等を採用した名詞句を利用し、情景をイメージしやすい語彙を抽出することも可能である。 1. (P? (N | N) s?) {2,4}
2. (P? (N | N) s?) {1,3} a (p? (N | N) {1,3}
3. N
The result of the morphological analysis is connected, and the longest matching of the word strings corresponding to any of the above is acquired as the topic candidate. If this is followed, for example, “crime statement of terrorist organization”, “chiffon cake”, “US presidential election”, etc. can be acquired. In addition to this, it is also possible to extract a vocabulary that is easy to imagine a scene by using a noun phrase that employs a combination of adjectives and nouns or a combination of adjective verbs and nouns.

以上のような内容が、話題語ルール２２に記述されている。 The contents as described above are described in the topic word rule 22.

ステップ６０４）キーワードの中に全て含まれてしまうような話題語の候補、つまり、「キーワード」⊇「話題語の候補」となる話は、話題語候補から外す。例えば、キーワードが「甲子園球場」のとき、話題語候補として「甲子園球場」、「甲子園」、「球場」が取得された場合、これら３つの語句が文字列として、キーワードに全て包含されているため候補から外すことができる。 Step 604) Topic word candidates that are all included in the keywords, that is, stories that are “keywords” ⊇ “topic word candidates” are excluded from the topic word candidates. For example, when the keyword is “Koshien Stadium” and “Koshien Stadium”, “Koshien”, and “Baseball Stadium” are acquired as topic word candidates, these three phrases are all included in the keyword as character strings. Can be removed from the candidate.

ステップ６０５）上記で抽出された話題語候補を話題語ＤＢ２１に格納する。この際、話題語候補には話題語を抽出した基の文書情報も関連付けられて格納される。また、各々の文書から抽出した話題語候補が別の文書と同一になった場合は、１つの話題語候補として取り扱う。よって、話題語ＤＢ２１の中には、一度に取得した文書全体の中で重複した話題語は存在しないことになる。 Step 605) The topic word candidates extracted above are stored in the topic word DB 21. At this time, the document information of the base word extracted from the topic word is also stored in association with the topic word candidate. Further, when a topic word candidate extracted from each document becomes the same as another document, it is handled as one topic word candidate. Therefore, the topic word DB 21 does not have duplicate topic words in the entire document acquired at one time.

以上の処理を、話題語抽出対象文書数分繰り返し、それぞれの文書に対する話題語の候補を抽出する。 The above processing is repeated for the number of topic word extraction target documents, and topic word candidates for each document are extracted.

（３）話題語集約処理部５２０：
図７は、本発明の第１の実施の形態における話題語選定処理のフローチャートである。 (3) Topic word aggregation processing unit 520:
FIG. 7 is a flowchart of topic word selection processing in the first embodiment of the present invention.

ステップ７０１）話題語集約処理部５２０は、話題語ＤＢ２１から話題語候補を取得する。 Step 701) The topic word aggregation processing unit 520 acquires a topic word candidate from the topic word DB 21.

ステップ７０２）話題語候補の中で次のような２種類の話題語が存在した場合は、１つに集約する。「ＤａＥ」と「ＤＥ」。ここで、ＤとＥは、（ｐ？（ｎ│Ｎ）ｓ？）｛１，３｝であり、ａは格助詞の「の」である。ＤとＥの正規表現については、前述の（２）の話題語候補抽出処理部の説明の通りである。このとき、どちらの話題語候補に集約するかは、規定されないが、本実施の形態では、ＤａＥへ集約することとする。話題語候補と文書の関係は、集約された話題語候補へと引き継がれる。例えば、「プログラミング言語」「プログラミングの言語」が話題語候補としてあがった場合は、「プログラミングの言語」に集約することができる。そして、「プログラミング言語」を含む文書データは、「プログラミングの言語」へと引き継がれる（つまり、「プログラミングの言語」を含む文書データとして扱われる）。 Step 702) If the following two types of topic words exist among the topic word candidates, they are collected into one. “DaE” and “DE”. Here, D and E are (p? (N | N) s?) {1, 3}, and a is the case particle "no". The regular expressions D and E are as described in the topic word candidate extraction processing unit (2) above. At this time, which topic word candidate is aggregated is not stipulated, but in this embodiment, it is aggregated into DaE. The relationship between the topic word candidate and the document is handed over to the aggregated topic word candidate. For example, when “programming language” and “programming language” appear as topic word candidates, they can be collected into “programming language”. Then, the document data including the “programming language” is succeeded to the “programming language” (that is, it is treated as the document data including the “programming language”).

ステップ７０３）また、取得した話題語候補がキーワードを一部に含むような語句であれば、次の処理を行う。話題語候補の中で、
１．［１］ＫＤ
［２］ＫａＤ
［３］Ｄ
としたとき、［１］と［２］の候補が共存した場合、［２］に集約する。 Step 703) Further, if the acquired topic word candidate is a phrase including a keyword in part, the following processing is performed. Among the topic word candidates,
1. [1] KD
[2] KaD
[3] D
When the candidates [1] and [2] coexist, they are collected into [2].

［１］［３］、［２］［３］、［１］［２］［３］のいずれかの候補が共存した場合、［３］に集約する。 If any of the candidates [1] [3], [2] [3], [1] [2] [3] coexist, they are collected into [3].

２．［１］ＤＫ
［２］ＤａＫ
［３］Ｄ
としたとき、［１］［２］の候補が共存した場合、［２］に集約する。 2. [1] DK
[2] DaK
[3] D
When the candidates [1] and [2] coexist, they are collected into [2].

但し、
ａ：格助詞の「の」
Ｋ：キーワード
Ｄ：（ｐ？（ｎ│Ｎ）ｓ？）｛１，３｝
であるとする。 However,
a: Case particle "no"
K: Keyword D: (p? (N | N) s?) {1, 3}
Suppose that

例えば、キーワードが「犬」の場合、「犬のトリーミング」「トリーミング」が話題語候補としてあがった場合は「犬のトリーミング」に集約することができる。但し、話題語候補と文書の関係は、集約された話題語候補へと引き継がれる。 For example, in the case where the keyword is “dog”, “dog trimming” and “treeing” can be collected as “dog trimming” when “dog trimming” and “trimming” are raised as topic word candidates. However, the relationship between the topic word candidate and the document is inherited to the aggregated topic word candidate.

ステップ７０４）集約された話題語候補は、集約した側の話題語候補の付属情報として関連付け、話題語ＤＢ２１へ格納する。この際、関係する文書情報も共に格納する。 Step 704) The aggregated topic word candidates are associated as accessory information of the aggregated topic word candidates and stored in the topic word DB 21. At this time, related document information is also stored together.

以上を全ての話題語候補について処理する。この処理により、話題語候補の集約が終了する。 The above is processed for all topic word candidates. With this process, the aggregation of topic word candidates is completed.

（４）話題語スコア算出処理部５４０：
図８は、本発明の第１の実施の形態における話題語スコア算出処理及び、文書分類処理のフローチャートであり、ステップ８０１〜ステップ８０４が、話題語スコア算出処理部５４０で行われる処理であり、ステップ８０５〜ステップ８０８が、後述する文書分類処理部５５０で行われる処理である。 (4) Topic word score calculation processing unit 540:
FIG. 8 is a flowchart of the topic word score calculation process and the document classification process in the first embodiment of the present invention. Steps 801 to 804 are processes performed by the topic word score calculation processing unit 540. Steps 805 to 808 are processes performed by the document classification processing unit 550 described later.

ステップ８０１）話題語スコア算出処理部５４０は、話題語ＤＢ２１において、ある話題語候補もしくは、その付属情報の話題語候補のいずれかの文字列を含む文書情報を文書ＤＢ２０から取得する。 Step 801) The topic word score calculation processing unit 540 acquires, from the document DB 20, document information including a character string of a certain topic word candidate or a topic word candidate of the attached information in the topic word DB 21.

ステップ８０２）文書が検索結果から取得したものであり、かつ文書の更新日付情報がある場合は、文書の更新日付情報と検索結果出力順位情報を基にその文書に対する文書話題語スコアを算出する。例えば、次のような日付スコアと検索結果スコアを加算して文書話題語スコアを算出する方法がある。日付スコアは、例えば、文書を取得した時刻から２４時間以内であれば、「１」、それ以降であれば、「１」を経過した日数で割った値を日付スコアにする方法などでもよい。そして、検索結果出力順位情報、または、検索エンジンなどで用いられているようなスコアリング方法でもよい。 Step 802) If the document is obtained from the search result and there is update date information of the document, a document topic word score for the document is calculated based on the update date information of the document and the search result output order information. For example, there is a method of calculating a document topic word score by adding a date score and a search result score as follows. For example, the date score may be “1” if it is within 24 hours from the time when the document was acquired, and if it is after that, a value obtained by dividing “1” by the number of days passed may be used as the date score. A scoring method such as that used in search result output rank information or a search engine may be used.

検索結果スコアは、例えば、「１」を出力順位で割った値にする方法などでもよい。 For example, the search result score may be a value obtained by dividing “1” by the output rank.

ステップ８０３）文書が検索結果から取得したものであり、文書の日付情報がない場合には当該ステップ８０２の処理を飛ばし、検索結果出力順位情報を基に、文書に対する文書話題語スコアを計算する。例えば、「１」を出力順位で割った値にする方法でもよい。 Step 803) If the document is obtained from the search result and there is no date information of the document, the processing of step 802 is skipped, and the document topic word score for the document is calculated based on the search result output rank information. For example, a method in which “1” is divided by the output order may be used.

ステップ８０４）話題語候補もしくは、その付属情報である、集約された話題語候補のいずれかの文字列を含む文書全ての文書話題語スコアを合計し、その値を話題語スコアとする。図９に、本発明の第１の実施の形態における文書例と出力例を示す。図９に従うと、『文書Ａ』には、話題語候補として「アジアカップ」「日本」があり、『文書Ｂ』には「ワールドカップ予選」「日本」「韓国」があり、『文書Ｃ』には、話題語候補として「アジアカップ」があった場合、それぞれの文書話題語スコアが図９の（ａ）に示すようになっていた場合、話題語スコアは、図９の（ｂ）に示すようになる。そして、話題語スコアを話題語候補の付属情報として、話題語ＤＢ２１へ格納する。 Step 804) The document topic word scores of all the documents including the topic word candidates or the character strings of any of the aggregated topic word candidates that are the associated information are summed, and the value is set as the topic word score. FIG. 9 shows a document example and an output example in the first embodiment of the present invention. According to FIG. 9, “Document A” includes “Asia Cup” and “Japan” as topic word candidates, “Document B” includes “World Cup Qualifier”, “Japan”, and “Korea”, and “Document C”. If there is “Asia Cup” as a topic word candidate, and each document topic word score is as shown in FIG. 9A, the topic word score is shown in FIG. 9B. As shown. Then, the topic word score is stored in the topic word DB 21 as accessory information of the topic word candidate.

（５）文書分類処理部５５０：
図８のフローチャートを用いて文書分類処理部５５０の動作を説明する。 (5) Document classification processing unit 550:
The operation of the document classification processing unit 550 will be described using the flowchart of FIG.

ステップ８０５）文書分類処理部５５０は、文書が検索結果から取得したものである場合は、前述のステップ８０１〜ステップ８０４の処理の全ての話題語候補に対して実施する。そして、話題語スコアが高い順に話題語候補を並び替える。 Step 805) When the document is acquired from the search result, the document classification processing unit 550 performs the processing of the above-described steps 801 to 804 on all the topic word candidates. Then, the topic word candidates are rearranged in descending order of the topic word score.

ステップ８０６）話題語スコアが定数Ｙよりも低く、かつ、その話題語候補もしくは、それに集約された話題語候補（付属情報）の文字列を含む文書の数が定数Ｍよりも少ない話題語候補は、候補から外し、残りを話題語に設定する。このとき、ＹとＭは正の整数である。 Step 806) A topic word candidate whose topic word score is lower than the constant Y and whose topic word candidate or the number of documents including the character string of the topic word candidate (attached information) aggregated thereto is less than the constant M. , Remove from candidates and set the rest as topic words. At this time, Y and M are positive integers.

ステップ８０７）話題語もしくは、それに集約された話題語候補（付属情報）の文字列を１つも含まない文章がある場合は、「該当なし」として、話題語ＤＢ２１に格納する。 Step 807) If there is a sentence that does not contain any topic words or character strings of topic word candidates (attached information) aggregated thereto, it is stored in the topic word DB 21 as “not applicable”.

ステップ８０８）文書ＤＢ２０と話題語ＤＢ２１から、話題語とそれに関連する文書を話題語スコアが高い順、もしくは、文書数が多い順にリスト形式で出力する。その出力例を図９（ｃ）に示す。 Step 808) From the document DB 20 and the topic word DB 21, the topic words and the documents related to the topic words are output in a list format in descending order of the topic word score or in descending order of the number of documents. An example of the output is shown in FIG.

この情報は、リスト形式でなく、マップのような形式で表示されることも可能である。表示方法については、２次元や３次元等、様々に適用することができる。 This information can be displayed in a map-like format instead of a list format. Various display methods such as two-dimensional and three-dimensional can be applied.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

［第２の実施の形態］
本発明の第２の実施の形態では、ＲＳＳフォーマットで提供されているニュースやWebBlogポータルから最新のＲＳＳを取得し、そこから得られた情報から話題語を抽出し、その話題語毎に、コンテンツをクラスタリングすることを目的とする。ＲＳＳフォーマットで提供される情報は、随時更新されていることから、この情報を定期的に更新し、随時新着情報を取得し、新着の話題語とそのコンテンツをユーザに提供することができる。 [Second Embodiment]
In the second embodiment of the present invention, the latest RSS is acquired from the news provided in RSS format or the WebBlog portal, the topic word is extracted from the information obtained from the RSS, and the content for each topic word is extracted. Is intended to be clustered. Since the information provided in the RSS format is updated at any time, this information can be updated periodically, new arrival information can be acquired at any time, and new topic words and their contents can be provided to the user.

あるＵＲＩに検索キーワードを送信するとＲＳＳフォーマットで検索結果を返すようなサイトも存在する。よって、このサイトへ定期的にアクセスすることで調べたいキーワードに関連する情報を入手することも可能である。中には、あるキーワードを登録しておくと常にそのキーワードに関連する情報をＲＳＳフォーマットで提供するようはサイトも存在する。このため、新着の中でも気になるキーワードで情報を追いかけていくこともできる。例えば、「サッカー」の中でもどんなことが話題になっているかを把握することができる。また、キーワードを基に取得できないような情報であっても、本発明と検索エンジンをセットで利用することにより、随時更新される情報をＲＳＳフォーマットで取得し、コンテンツの情報をキーワードで検索し、興味のある情報だけを入力元にすることも可能である。 Some sites return search results in RSS format when a search keyword is transmitted to a certain URI. Therefore, it is also possible to obtain information related to the keyword to be examined by periodically accessing this site. Some sites also provide information related to a keyword in the RSS format whenever a keyword is registered. For this reason, it is possible to follow information with keywords that are of interest even among new arrivals. For example, it is possible to grasp what is being talked about in “soccer”. Moreover, even if it is information that cannot be acquired based on a keyword, by using the present invention and a search engine as a set, information that is updated at any time is acquired in RSS format, content information is searched by keyword, It is also possible to use only the information of interest as the input source.

処理の詳細については、以下で説明する。 Details of the processing will be described below.

図１０は、本発明の第２の実施の形態における話題語による情報分類装置の構成図である。 FIG. 10 is a configuration diagram of an information classification device based on topic words according to the second embodiment of the present invention.

同図に示す話題語による情報分類装置は、コンピュータ１０とこのコンピュータ１０にネットワーク４０を介して接続される文書データベース（文書ＤＢ）２０と話題語データベース（話題語ＤＢ）２１、話題語ルール２２を格納する話題語ルール記憶部６２０、話題語集約ルール２３を格納する話題語集約ルール記憶部６３０、ＮＧワードリスト２４を格納するＮＧワードリスト記憶部６４０で構成されている。 The information classification device based on topic words shown in FIG. 1 includes a computer 10, a document database (document DB) 20, a topic word database (topic word DB) 21, and topic word rules 22 connected to the computer 10 via a network 40. The topic word rule storage unit 620 stores the topic word aggregation rule storage unit 630 that stores the topic word aggregation rule 23, and the NG word list storage unit 640 that stores the NG word list 24.

コンピュータ１０は、ＲＡＭ，ＲＯＭ、磁気ディスク等からなるメモリ、ＣＰＵ、ディスプレイによる表示部１１、及びマウスやキーボードなどからなる指示入力部１２から構成されており、ＣＰＵが実行するソフトウェアプログラムによって実現されるデータ収集処理部５００と話題語候補抽出処理部５１０と話題語集約処理部５２０と時刻経過検査結果処理部５３０と話題語スコア算出処理部５４０と文書分類処理部５５０とを備えている。 The computer 10 includes a memory such as a RAM, a ROM, and a magnetic disk, a CPU, a display unit 11 using a display, and an instruction input unit 12 such as a mouse and a keyboard, and is realized by a software program executed by the CPU. A data collection processing unit 500, a topic word candidate extraction processing unit 510, a topic word aggregation processing unit 520, a time course inspection result processing unit 530, a topic word score calculation processing unit 540, and a document classification processing unit 550 are provided.

文書ＤＢ２０には、話題語抽出対象となるコンテンツのＵＲＬ（ＵＲＩ）、タイトル、本文内容（概要など）を表すテキスト文章、更新日時などのメタ情報が格納されている。 The document DB 20 stores meta information such as a URL (URI) of a content to be extracted from a topic word, a title, a text sentence representing a body content (such as an outline), and an update date and time.

話題語ＤＢ２１には、コンテンツから抽出された話題語候補とその話題語候補に関連するコンテンツの数とそのコンテンツの情報、話題語選定処理により選定された話題語とその話題語に関連するコンテンツの数、また、コンテンツ情報を取得した時間情報とそのコンテンツの情報、また、話題語候補となったが、話題語選定処理中に集約された話題語候補が格納される。 In the topic word DB 21, the number of topic words extracted from the content, the number of contents related to the topic word candidate, information about the content, the topic word selected by the topic word selection process, and the contents related to the topic word are stored. The number, the time information at which the content information is acquired, the information on the content, and the topic word candidates that have become topic word candidates but are aggregated during the topic word selection process are stored.

話題語ルール２２は、話題語ルール記憶部６２０に格納され、話題語を抽出するための条件が記述されている。このルールは追加変更などが自在に行える。 The topic word rule 22 is stored in the topic word rule storage unit 620 and describes conditions for extracting a topic word. This rule can be freely added and changed.

話題語集約ルール２３は、話題語集約ルール記憶部６３０に格納され、同一の意味にとれるような話題語同士を集約するための条件が記述されている。このルールは、追加変更などが自在に行える。 The topic word aggregation rule 23 is stored in the topic word aggregation rule storage unit 630 and describes conditions for aggregating topic words that have the same meaning. This rule can be freely added and changed.

ＮＧワードリスト２４は、話題語としてふさわしくない語句を集めたリストである。このＮＧワードリスト２４は、簡略化して用意せずに処理することも可能である。 The NG word list 24 is a list in which phrases that are not suitable as topic words are collected. The NG word list 24 can be simplified and processed without preparation.

データ収集処理部５００、話題語候補抽出処理部５１０、時刻経過検査処理部５３０、話題語集約処理部５２０や、話題語スコア算出処理部５４０は、このように構成される話題語抽出システムにおいて、以降に説明する処理を実行することで本発明を実現するように動作する。 The data collection processing unit 500, topic word candidate extraction processing unit 510, time course inspection processing unit 530, topic word aggregation processing unit 520, and topic word score calculation processing unit 540 are the topic word extraction system configured as described above. By performing the processing described below, the operation is performed to realize the present invention.

以下のその処理を説明する。 The process will be described below.

（１）データ収集処理部５００：
図１１に本発明の第２の実施の形態におけるデータ収集処理のフローチャートを示す。 (1) Data collection processing unit 500:
FIG. 11 shows a flowchart of data collection processing in the second embodiment of the present invention.

ステップ１００１）データ収集処理部５００は、まず、指定されたキーワードを基に、事前に登録されたＲＳＳ提供サイトのＵＲＩから、ＲＳＳフォーマットで記述されたコンテンツ群を取得する。例えば、gooblog検索の場合、
http://blog.goo.ne.jp/search/search.php?status&tg=all&st=time&dc=50&dp=all&bu=&ts=all&MT=「検索キー」&da=all&rss=1&fr=1
のようなアドレスを送るだけで、「検索キー」を検索キーワードとした検索結果５０件分がＲＳＳフォーマットで返ってくる。但し、この検索キーは、ＵＲＬエンコード（エスケープ）する必要がある。「サッカー」は、「％Ａ５％Ｂ５％Ａ５％Ｃ３％Ａ５％ＡＢ％Ａ１％ＢＣ」となる
これは、複数のサイト（ＵＲＩ）を事前に登録しておくことも可能である。複数のサイト（ＵＲＩ）が登録してある場合は、データ収集処理部５００のフローを登録数分繰り返すことで実現できる。 Step 1001) The data collection processing unit 500 first acquires a content group described in the RSS format from the URI of the RSS providing site registered in advance based on the designated keyword. For example, for gooblog search,
http://blog.goo.ne.jp/search/search.php?status&tg=all&st=time&dc=50&dp=all&bu=&ts=all&MT="Search Key "& da = all & rss = 1 & fr = 1
By simply sending an address like this, 50 search results using the “search key” as a search keyword are returned in RSS format. However, this search key needs to be URL encoded (escape). “Soccer” becomes “% A5% B5% A5% C3% A5% AB% A1% BC”. It is also possible to register a plurality of sites (URIs) in advance. When a plurality of sites (URIs) are registered, it can be realized by repeating the flow of the data collection processing unit 500 for the number of registrations.

図１２に本発明の第２の実施の形態におけるＲＳＳ記述例を示す。 FIG. 12 shows an RSS description example in the second embodiment of the present invention.

ステップ１００２）取得した情報は、<item>と</item>タグで挟まれた情報を１コンテンツとみなす。そして、その中にある<title>：タイトル、<link>：リンク情報、<description>：本文（コンテンツの概要等）を取得する。 Step 1002) Regarding the acquired information, information sandwiched between <item> and </ item> tags is regarded as one content. Then, <title>: title, <link>: link information, and <description>: text (content outline, etc.) are acquired.

ステップ１００３） “description”は、記述されている場合とされていない場合がある。また、“description”を分析に扱うには、あまりにも少ない文字列であるような場合もある。例えば、１文にも満たない文字列であった場合、話題語を抽出するには不向きといえる。このように、本文が取得できなかった場合は、リンク情報を基に直接コンテンツの本文を取得する。このとき、本文の取得の仕方は様々ある。例えば、htmlファイルであれば、<p>タグで囲まれた最も長い文章が記述されている部分を本文とみなし取得することもできる。長文が得られるような場合は、その文書の中でも検索キーワードの前後の文章を取得し、本文とする。例えば、キーワードを真ん中に前後に２５６文字を取得することや、キーワードを含む文書に加えて前後の１文章を取得することなどでもよい。文章の中で検索キーワードが複数出てくる場合は、それらの付近の文章全てをまとめて本文として扱う。いずれの場合もコンテンツを説明している本文とみなし、以下で扱うこととする。 Step 1003) “description” may or may not be described. In addition, there are cases where there are too few character strings to handle “description” for analysis. For example, if the character string is less than one sentence, it can be said that it is not suitable for extracting a topic word. As described above, when the text cannot be acquired, the text of the content is directly acquired based on the link information. At this time, there are various ways of acquiring the text. For example, in the case of an html file, it is possible to obtain the text in which the longest sentence surrounded by <p> tags is described. If a long sentence can be obtained, the sentences before and after the search keyword are acquired and used as the text. For example, 256 characters may be acquired before and after the keyword in the middle, or one sentence before and after the document including the keyword may be acquired. When multiple search keywords appear in a sentence, all the sentences in the vicinity are treated as a body. In either case, the content is regarded as a body describing the content and will be handled below.

ステップ１００４）上記のステップで得たコンテンツ情報、つまりタイトルとリンク情報と本文、そして、ＲＳＳの取得時刻をコンテンツの１セットとし、文書ＤＢ２０へ格納する。ＲＳＳ内には、複数のコンテンツ情報が記述されていることが多い。このため、コンテンツの数分情報を取得し、文書ＤＢ２０へ格納することとする。 Step 1004) The content information obtained in the above steps, that is, the title, link information, text, and RSS acquisition time are set as a set of content and stored in the document DB 20. In RSS, a plurality of pieces of content information are often described. For this reason, information for the number of contents is acquired and stored in the document DB 20.

（２）話題語候補抽出処理部５１０：
図１３に本発明の第２の実施の形態における話題語候補抽出処理のフローチャートを示す。 (2) Topic word candidate extraction processing unit 510:
FIG. 13 shows a flowchart of topic word candidate extraction processing in the second exemplary embodiment of the present invention.

ステップ１１０１）話題語候補抽出処理部５１０は、まず、文書ＤＢ２０より最新のＲＳＳより取得したコンテンツの本文とタイトルを取り出す。これをコンテンツの文書とする。本文のみを文書としてもよい。 Step 1101) The topic word candidate extraction processing unit 510 first extracts the body and title of the content acquired from the latest RSS from the document DB 20. This is a content document. Only the text may be used as a document.

ステップ１１０２）次に、その文書を形態素解析する。 Step 1102) Next, the document is subjected to morphological analysis.

ステップ１１０３）形態素解析結果から話題語候補を抽出する。話題語は、扱う文章や分野、またユーザの好みによっても適する語句が異なる場合がある。本実施の形態では、次のような理由から３種類を話題語とし、名詞句を抽出することにし、そのルールは、話題語ルール２２に記述されていることとする。 Step 1103) Extract topic word candidates from the morphological analysis results. The topic word may have a different suitable phrase depending on the sentence and field to be handled and the preference of the user. In the present embodiment, three types of topic words are used for the following reasons, and noun phrases are extracted, and the rules are described in the topic word rule 22.

まず、多くの人がその内容について取り上げているもので、時間的な変化（短期的に集中することや長期で話されていること）があるものが、話題語として適しているといえる。そして、その中でもユーザの興味をひくようなインパクトが強いものや一見してコンテンツの内容がイメージできるものの方が利用価値が高い。 First of all, many people take up the content, and those that change over time (concentrated in the short term or spoken in the long term) are suitable as topic words. Among them, those that have a strong impact that attracts the user's interest and those that can give an image of the content at first glance have higher utility value.

インパクトが強い言葉としては、固有名詞や新しい言葉があげられる。新しい言葉は、形態素解析の辞書に登録されていないことが多いため、未知語の連続として扱われる。そこで、今回は、カタカナの未知語の連続を固有名詞とし、アルファベットの未知語の連続を名詞と位置付けて採用することとする。 Examples of words that have a strong impact include proper nouns and new words. New words are often not registered in the morphological analysis dictionary and are therefore treated as a series of unknown words. Therefore, this time, we will adopt a series of unknown words in Katakana as proper nouns and a sequence of unknown words in alphabet as nouns.

一見してコンテンツの内容がイメージできる言葉は、それ自身で具体性のある固有名詞や連続することで具体性の高まる名詞の連続、また、名詞同士の格助詞の「の」で挟んだ言葉等があげられる。また、その言葉を一見してイメージできるものとして、長すぎる言葉は、ユーザが読み上げなくてはならなくなるために適さない。よって、一定の長さ以下であることが望ましい。 At first glance, words that can be used to imagine the content of the content include specific nouns that are specific to themselves, a series of nouns that become more specific by being continuous, or words that are sandwiched between “no” of case particles between nouns, etc. Can be given. Also, too long a word that can be imaged at a glance is not suitable because the user has to read it out. Therefore, it is desirable that the length is not more than a certain length.

以上のことから、話題語として扱われる中でも固有名詞、名詞の連続、名詞+格助詞「の」+名詞を、一定の長さ以下で記述された名詞句を本実施の形態では採用することとする。以下にその正規表現を示す。「？」は、直前の語の０かまたは、１回の出現を意味し、「│」は、その記号の前後の語のｏｒをとることであり、「｛Ａ，Ｂ｝」は、直前の語のＡ回以上Ｂ回以下の繰り返しを意味する。但し、Ａ，Ｂは正の整数である。また、ａ：格助詞「の」、ｎ：名詞、Ｎ：固有名詞、ｐ：名詞接頭辞、ｓ：名詞接尾辞を表す。 From the above, even though it is treated as a topic word, a proper noun, a series of nouns, a noun + case particle `` no '' + noun, a noun phrase described below a certain length is adopted in this embodiment. To do. The regular expression is shown below. “?” Means 0 or 1 occurrence of the immediately preceding word, “|” means to take or of the word before and after the symbol, and “{A, B}” Is repeated from A to B times. However, A and B are positive integers. Also, a: case particle “no”, n: noun, N: proper noun, p: noun prefix, s: noun suffix.

１．（ｐ？（ｎ│Ｎ）ｓ？）｛２，４｝
２．（ｐ？（ｎ│Ｎ）ｓ？）｛１，３｝ａ（ｐ？（ｎ│Ｎ）ｓ？）｛１，３｝
３．Ｎ
形態素解析した結果を繋ぎ合わせ、上記のいずれかに該当する単語列の最長のマッチングを話題語候補として取得する。これ以外にも形容詞と名詞の組み合わせや形容動詞と名詞の組み合わせ等を採用した名詞句を利用し、情景をイメージしやすい語彙を抽出することも可能である。 1. (P? (N | N) s?) {2,4}
2. (P? (N | N) s?) {1,3} a (p? (N | N) s?) {1,3}
3. N
The result of the morphological analysis is connected, and the longest matching of the word string corresponding to any of the above is acquired as the topic word candidate. In addition to this, it is also possible to extract a vocabulary that is easy to imagine a scene by using a noun phrase that employs a combination of adjectives and nouns or a combination of adjective verbs and nouns.

ステップ１１０４）キーワードの中に全て含まれてしまうような話題語の候補、つまり、「キーワード」⊇「話題語の候補」となる語は、話題語候補から外す。 Step 1104) Topic word candidates that are all included in the keyword, that is, words that are “keyword” ⊇ “topic word candidate” are excluded from the topic word candidates.

ステップ１１０５）上記で抽出された話題語候補を話題語ＤＢ２１に格納する。この際、話題語候補には、話題語を抽出した基の文書情報も関連付けられて格納される。また、各々の文書から抽出した話題語候補が別の文書と同一になった場合は、１つの話題語候補として取り扱う。よって、話題語ＤＢ２１の中には、一度に取得した文書全体の中で重複した話題語は存在しないことになる。 Step 1105) The topic word candidates extracted above are stored in the topic word DB 21. At this time, the topic word candidate is also stored in association with the document information of the basis from which the topic word is extracted. Further, when a topic word candidate extracted from each document becomes the same as another document, it is handled as one topic word candidate. Therefore, the topic word DB 21 does not have duplicate topic words in the entire document acquired at one time.

（３）話題語集約処理部５２０：
図１４は、本発明の第２の実施の形態に置ける話題語集約処理のフローチャートである。 (3) Topic word aggregation processing unit 520:
FIG. 14 is a flowchart of topic word aggregation processing according to the second embodiment of this invention.

同図に示すフローチャートにおいて、ステップ１２０１〜ステップ１２０３、ステップ１２０５のステップは、前述の第１の実施の形態における図７のステップ７０１〜７０４の動作の同様である。 In the flowchart shown in the figure, steps 1201 to 1203 and 1205 are the same as the operations of steps 701 to 704 in FIG. 7 in the first embodiment described above.

本実施の形態では、話題語集約処理部５２０において、ＮＧワードリスト２４がある場合には、そのリストと照合し、該当する話題語候補を、候補から削除する（ステップ１２０４）。 In the present embodiment, if there is an NG word list 24 in the topic word aggregation processing unit 520, it is checked against the list and the corresponding topic word candidate is deleted from the candidates (step 1204).

（４）時刻経過検査処理部５３０：
図１５に本発明の第２の実施の形態における時刻経過検査処理のフローチャートを示す。 (4) Time-lapse inspection processing unit 530:
FIG. 15 shows a flowchart of the time lapse inspection process in the second embodiment of the present invention.

ステップ１３０１）時刻経過検査処理部５３０は、話題語ＤＢ２１の中で、最新のコンテンツ情報を取得した時刻ＴＮから定数Ｔ（時刻を表す正の値）時間さかのぼった時刻ＴＰまでの全ての話題語とそのコンテンツ情報を取得した時刻とを取得する。 Step 1301) The time-lapse inspection processing unit 530 stores all the topic words in the topic word DB 21 from the time TN at which the latest content information is acquired to the time TP that is back by a constant T (a positive value representing time). The time when the content information was acquired is acquired.

ステップ１３０２）時刻ＴＰからＴＮまでの話題語（その話題語の補助情報である話題語候補も含む）とＴＮで取得した話題語候補を照合し、時刻が変化しても絶え間なく抽出され続けている話題語、もしくは、話題語候補がある場合、その語句をＴＮの話題語候補から削除する。また、ＮＧワードリスト２４がある場合は、リストにその話題語や付属情報である話題語候補を追加する。 Step 1302) The topic word from time TP to TN (including the topic word candidate that is auxiliary information of the topic word) and the topic word candidate acquired at TN are collated, and they are continuously extracted even when the time changes. If there is a topic word or a topic word candidate, the phrase is deleted from the topic word candidates of TN. If the NG word list 24 is present, the topic word candidate that is the topic word or attached information is added to the list.

この処理により、本文とは関係ない情報を誤認識して、いつも同じ文書を抽出しているような場合等、話題語としてふさわしくない語が抽出されてしまう危険性を防ぐことができる。 By this processing, it is possible to prevent a risk that a word unsuitable as a topic word is extracted, for example, when information unrelated to the text is misrecognized and the same document is always extracted.

（５）話題語スコア算出処理部５４０・文書分類処理部５５０：
この処理は、第１の実施の形態における話題スコア算出処理部５４０及び文書分類処理部５５０と同じ処理を行うため、これらの処理部の説明は省略する。 (5) Topic word score calculation processing unit 540 and document classification processing unit 550:
Since this processing is the same as the topic score calculation processing unit 540 and the document classification processing unit 550 in the first embodiment, description of these processing units is omitted.

最終的に得られる情報は、リスト形式でなく、マップのような形式で表示されることも可能である。表示方法については、２次元や３次元等様々に適用することができる。また、文書を繰り返し一定時間毎に取得する場合は、話題語を時系列で保存していることから、話題語の推移を見ることができる。この推移は、ＲＳＳを取得した時刻毎に話題語ＤＢ２１から話題語を取得し、出力することで可能となる。 The information finally obtained can be displayed not in a list format but in a map-like format. About a display method, it can apply variously, such as two dimensions and three dimensions. In addition, when the document is repeatedly acquired at regular intervals, the topic words are stored in time series, so that the transition of the topic words can be seen. This transition is made possible by acquiring and outputting a topic word from the topic word DB 21 at each time when RSS is acquired.

なお、上記の第１の実施の形態及び第２の実施の形態の情報分類装置の各構成要素の動作をプログラムとして構築し、情報分類装置として利用されるコンピュータにインストールする、または、ネットワークを介して流通させることが可能である。 In addition, the operation | movement of each component of the information classification apparatus of said 1st Embodiment and 2nd Embodiment is built as a program, and it installs in the computer utilized as an information classification apparatus, or via a network Can be distributed.

また、構築されたプログラムを、情報分類装置として利用されるコンピュータに接続されるハードディスク装置や、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納することも可能である。 In addition, the constructed program can be stored in a hard disk device connected to a computer used as an information classification device, a portable storage medium such as a flexible disk or a CD-ROM.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、ニュースやBlog等の日々更新されていく情報から最新の話題となっている語を自動抽出し、その情報を分類する技術に適用可能である。 The present invention can be applied to a technique for automatically extracting words that are the latest topics from information that is updated daily, such as news and blogs, and classifying the information.

本発明の原理説明図である。It is a principle explanatory view of the present invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の概要を説明するための図である。It is a figure for demonstrating the outline | summary of this invention. 本発明の第１の実施の形態における話題語による情報分類装置の構成図である。It is a block diagram of the information classification device by the topic word in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるデータ収集処理のフローチャートである。It is a flowchart of the data collection process in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題語候補抽出処理のフローチャートである。It is a flowchart of the topic word candidate extraction process in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題語集約処理のフローチャートである。It is a flowchart of the topic word aggregation process in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題語スコア算出処理及び文書分類処理のフローチャートである。It is a flowchart of the topic word score calculation process and document classification process in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書例と出力例である。It is an example of a document and an example of output in a 1st embodiment of the present invention. 本発明の第２の実施の形態における話題語による情報分類装置の構成図である。It is a block diagram of the information classification device by the topic word in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるデータ収集処理のフローチャートである。It is a flowchart of the data collection process in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるＲＳＳ記述例である。It is an RSS description example in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題語候補抽出処理のフローチャートである。It is a flowchart of the topic word candidate extraction process in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題語集約処理のフローチャートである。It is a flowchart of the topic word aggregation process in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における時刻経過検査処理のフローチャートである。It is a flowchart of the time passage inspection process in the second embodiment of the present invention.

Explanation of symbols

１０コンピュータ
１１表示部
１２指示入力部
２０文書ＤＢ
２１話題語ＤＢ
２２話題語ルール
２３話題語集約ルール
２４ＮＧワードリスト
４０ネットワーク
３００データ収集手段
３１０話題語候補抽出手段
３２０話題語集約手段
３３０時刻経過検査手段
３４０話題語スコア算出手段
３５０文書分類手段
５００データ収集処理部
５１０話題語候補抽出処理部
５２０話題語集約処理部
５３０時刻経過検査処理部
５４０話題語スコア算出処理部
５５０文書分類処理部
６２０話題語ルール記憶手段、話題語ルール記憶部
６３０話題語集約ルール記憶手段、話題語集約ルール
６４０ＮＧワードリスト記憶手段、ＮＧワードリスト記憶部 10 Computer 11 Display unit 12 Instruction input unit 20 Document DB
21 Topic DB
22 topic word rule 23 topic word aggregation rule 24 NG word list 40 network 300 data collection means 310 topic word candidate extraction means 320 topic word aggregation means 330 time course inspection means 340 topic word score calculation means 350 document classification means 500 data collection processing unit 510 topic word candidate extraction processing unit 520 topic word aggregation processing unit 530 time course inspection processing unit 540 topic word score calculation processing unit 550 document classification processing unit 620 topic word rule storage unit, topic word rule storage unit 630 topic word aggregation rule storage unit , Topic word aggregation rule 640 NG word list storage means, NG word list storage unit

Claims

An information classification apparatus for performing a classification with a feature related to a keyword by extracting a topic word from document data acquired based on a keyword and classifying the document data according to the topic word.
Topic word rule storage means for storing topic word rules using combinations of parts of speech as conditions for extracting topic words;
A topic word aggregation rule storage means that stores topic word aggregation rules for aggregating topic words that can be taken in the same meaning, consisting of combinations of parts of speech ;
As specified keyword search keyword, update date on which the document data, the search result output ranking, text (sentence) or obtained from the search results sentence containing a portion of the body, the body or the said body A data collection means for supplementarily collecting the text based on the publication location of the document data obtained from the search result and storing it in the document database;
Topic words that read out document data from the document database, extract topic word candidates from the document data with reference to the topic word rules stored in the topic word rule storage means, and store them in the topic word database Candidate extraction means;
Based on the said topic word aggregation rules stored in the topic word aggregation rule storage means, and the topic word aggregation means for aggregating the topic word candidates read out from the topic word database,
In each of the topic word candidates of the topic word database, and the topic word score calculating means for calculating a topic word score by the update time of the relevant height and the document data of the document data and the retrieval keyword including the topic word,
The topic word score of each of the topic word candidates in the topic word database is narrowed down to a predetermined number of topic word candidates, and further, the number of documents including the character string of the topic word candidate is narrowed down to a predetermined number or more of topic word candidates. the selected topic words having a document classification means for classifying the topic word score descending order of document data for each topic word, a,
The topic word score calculating means includes:
Means for summing the topic word scores of the topic word candidates with a document topic word score obtained from document data including a character string of the corresponding topic word candidate;
Means for determining the document topic word score from a search result output rank of one document data including a character string of a corresponding topic word candidate and an update date of document data obtained from the search result;
An information classification device characterized by comprising:

The topic word aggregation means includes:
As the topic word,
A phrase using a combination of parts of speech that satisfies a rule predefined by the topic word rule,
A phrase not included in the search keyword,
The phrase is not suitable as a topic word phrase does not exist in the N G word list that is stored,
Phrases that are aggregated and selected as a single topic word based on the topic word aggregation rule, such that they can have the same meaning,
Information classification apparatus of claim 1 further comprising a means for extracting.

When the information classification based on the topic word is repeatedly executed, if the extracted topic word continues to be extracted for a certain period T (positive integer), the word / phrase is excluded from the topic word. information classification apparatus according to claim 1, further comprising a time course testing means.

The information classification program for functioning a computer as each means which comprises the information classification device of any one of Claims 1 thru | or 3.

A computer-readable recording medium storing the information classification program according to claim 4 .