JP4206961B2

JP4206961B2 - Topic extraction method, apparatus and program

Info

Publication number: JP4206961B2
Application number: JP2004136590A
Authority: JP
Inventors: 吉秀佐藤; 晴美川島; 伸治安部; 雅且大久保
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-04-30
Filing date: 2004-04-30
Publication date: 2009-01-14
Anticipated expiration: 2024-04-30
Also published as: JP2005316899A

Description

本発明は、話題抽出方法及び装置及びプログラムに係り、特に、分野別に分類されたニュース記事などの新しい情報を含む文書を次々と入手しうる状況において、各分野の文書中から最近の話題となっている語句を自動的に抽出するための話題抽出方法及び装置及びプログラムに関する。 The present invention relates to a topic extraction method, apparatus, and program, and in particular, in a situation where documents containing new information such as news articles classified by field can be obtained one after another, it has become a recent topic among documents in each field. The present invention relates to a topic extraction method, apparatus, and program for automatically extracting existing phrases.

新聞やテレビなどのメディアから入手できる情報は日々増加しているが、インターネットの普及による影響は特に著しく、ともすれば氾濫した情報はすぐにも埋もれてしまう。このような状況の中、最近になって更新・追加された情報は現在の世間の流行や関心事、新着情報など、タイムリーな情報を含んでいる可能性が高い。従って、作成時刻の新しい文書を数多く収集して解析すれば、最近のトレンドやタイムリーな出来事を把握することができる。 Information that can be obtained from media such as newspapers and television is increasing day by day, but the impact of the spread of the Internet is particularly significant, and flooded information is immediately buried. Under such circumstances, the information updated / added recently is likely to contain timely information such as current trends, interests, and new information. Therefore, by collecting and analyzing many documents with new creation times, it is possible to grasp recent trends and timely events.

複数の文書情報から話題を表す語を抽出する技術には、文脈的な規則や言語的な知識を用いる方法がある（例えば、特許文献１参照）。これは、話題が転換する際に用いられる表現を学習させた辞書を使用し、さらに転換後の話題と転換前の話題との関係を考慮して話題候補の検出を行う方法である。
特開平６−１３９２７６号公報 As a technique for extracting a word representing a topic from a plurality of document information, there is a method using contextual rules and linguistic knowledge (see, for example, Patent Document 1). This is a method of detecting a topic candidate in consideration of the relationship between a topic after conversion and a topic before conversion, using a dictionary in which expressions used when the topic changes are learned.
JP-A-6-139276

しかしながら、上記従来の方法は、ドメインの知識を必要としない方法であるが、話題検出を行う前に、『ところで』『次に』『さて』など、話題転換時に使用される表現を集めた辞書を作る必要がある。また、対象言語が変われば辞書の再構築が必要となる。 However, the above conventional method is a method that does not require domain knowledge, but before topic detection, it is a dictionary that collects expressions used at the time of topic change, such as “by the way”, “next”, “said”. Need to make. Also, if the target language changes, the dictionary needs to be rebuilt.

本発明は、上記の点に鑑みなされたもので、次々と取得される文書データを解析し、何ら知識を必要とせずに話題を表す語や速報性の高い重要な語を抽出する話題抽出方法及び装置及びプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION The present invention has been made in view of the above points. A topic extraction method that analyzes document data acquired one after another and extracts words that express topics and important words with high speed without requiring any knowledge. And an apparatus and a program.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、作成時刻情報を持つ多数の文書を解析し、該文書中の語句について、時間と共に変化する該語句の出現頻度推移を調査することで該語句の話題性の強度を数値化して出力する話題抽出方法において、
語句取得手段が、文書の作成時刻、該文書を一意に定めるＩＤ，及び該文書中に出現する語句を組にして、記録した全語句蓄積データベースから、解析時刻範囲内の作成時刻を持つ語句を取得する語句取得ステップ（ステップ１，２）と、
語句話題度算出手段が、現在時刻を起点として、Tplusだけ過去の時刻までの区間（正区間）と、現在時刻よりTplusだけ過去の時刻を起点としてTminusだけ過去の時刻までの区間（負区間）とからなる区間を解析時刻範囲とし、該正区間に対して正値を、該負区間に対して負値を、該正区間における該正値の積分値の絶対値より該負区間における該負値の積分値の絶対値の方が大きくなるよう設定しておき、該正区間中に出現した語句の出現頻度分、該正値を加算してバッファに格納していき、また、該負区間中に出現した語句の出現頻度分、該負値を加算してバッファに格納していき、最終的に、該バッファに格納されている、加算した正値と負値を合算して該語句の話題度を算出する語句話題度算出ステップ（ステップ３）と、
出力手段が、算出された話題度を出力語句記憶手段に出力する出力ステップ（ステップ４）と、
からなり、
語句話題度算出ステップ（ステップ３）は、解析時刻範囲内の作成時刻を持つ文書数が所定の数に満たない場合は、該解析時刻範囲を拡大させる。 The present invention (Claim 1) analyzes a large number of documents having creation time information, and investigates the frequency of appearance of the phrases that change with time for the phrases in the documents, thereby increasing the topical strength of the phrases. In the topic extraction method of quantifying and outputting
The phrase acquisition means sets a phrase having a creation time within the analysis time range from a recorded all-phrase accumulation database by combining the creation time of the document, an ID that uniquely identifies the document, and the phrase that appears in the document. A phrase acquisition step (steps 1 and 2) to be acquired;
Phrase topic degree calculation means, starting from the current time, the interval from Tplus to the past time (positive interval), and from the current time to the past time by Tplus, the interval from Tminus to the past time (negative interval) and interval analysis time range consisting of, negative and positive values with respect to the positive zone, the negative values for negative interval, in absolute value than the negative section of the integral value of the positive values in the positive section It is set so that the absolute value of the integral value of the value is larger , and the positive value is added to the appearance frequency of the word appearing in the positive interval and stored in the buffer, and the negative interval The negative value is added and stored in the buffer for the appearance frequency of the word that appears inside, and finally, the added positive value and negative value stored in the buffer are added together to add the negative value . A word topic level calculation step (step 3) for calculating the topic level;
An output step (step 4) in which the output means outputs the calculated topic level to the output word storage means;
Consists of
The word topic level calculation step (step 3) expands the analysis time range when the number of documents having the creation time within the analysis time range is less than a predetermined number.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項２）は、作成時刻情報を持つ多数の文書を解析し、該文書中の語句について、時間と共に変化する該語句の出現頻度推移を調査することで該語句の話題性の強度を数値化して出力する話題抽出装置であって、
文書の作成時刻、該文書を一意に定めるＩＤ，及び該文書中に出現する語句を組にして記録した全語句蓄積データベース１３と、
全語句蓄積データベース１３から、解析時刻範囲内の作成時刻を持つ語句を取得する語句取得手段１５１と、
現在時刻を起点としてTplusだけ過去の時刻までの区間（正区間）と、現在時刻よりTplusだけ過去の時刻を起点としてTminusだけ過去の時刻までの区間（負区間）とからなる区間を解析時刻範囲とし、該正区間に対して正値を、該負区間に対して負値を、該正区間における該正値の積分値の絶対値より該負区間における該負値の積分値の絶対値の方が大きくなるよう設定しておき、該正区間中に出現した語句の出現頻度分、該正値を加算してバッファに格納していき、また、該負区間中に出現した語句の出現頻度分、該負値を加算してバッファに格納していき、最終的に、該バッファに格納されている、加算した正値と負値を合算して該語句の話題度を算出する語句話題度算出手段１５２と、
算出された話題度を出力語句記憶手段に出力する出力手段１５３と、
を有し、
語句話題度算出手段１５２は、
解析時刻範囲内の作成時刻を持つ文書数が所定の数に満たない場合は、該解析時刻範囲を拡大させる手段を含む。
The present invention (Claim 2) analyzes a large number of documents having creation time information, and investigates the frequency of appearance of the phrases that change with time for the phrases in the documents, thereby increasing the topical strength of the phrases. Is a topic extraction device that quantifies and outputs
An all-phrase storage database 13 that records a creation time of a document, an ID that uniquely identifies the document, and a phrase that appears in the document as a set;
A phrase acquisition unit 151 for acquiring a phrase having a creation time within an analysis time range from the all-phrases storage database 13;
Analyzed time range from the current time to Tplus past time (positive interval) and Tplus from the current time Tplus past time to the past time (negative interval) and then, a positive value with respect to the positive zone, the negative values for negative period, of the absolute value of the integrated value of the negative value in absolute value than the negative section of the integral value of the positive values in the positive section Is set to be larger, and the positive value is added to the appearance frequency of the phrase appearing in the positive section and stored in the buffer, and the appearance frequency of the phrase appearing in the negative section The phrase topic level is calculated by adding the negative value and storing the result in the buffer, and finally adding the positive value and the negative value stored in the buffer to calculate the topic level of the phrase. Calculating means 152;
Output means 153 for outputting the calculated topic level to the output phrase storage means;
Have
The phrase topic degree calculation means 152
When the number of documents having the creation time within the analysis time range is less than the predetermined number, means for expanding the analysis time range is included.

本発明（請求項３）は、請求項２記載の話題抽出装置を構成する各手段としてコンピュータを機能させるための話題抽出プログラムである。
The present invention (Claim 3) is a topic extraction program for causing a computer to function as each means constituting the topic extraction apparatus according to claim 2 .

上記のように、本発明によれば、ニュース記事のように速報性が高く、時々刻々と増加する多数の文書を対象として収集し、文書中の語句から話題性の高い語句を自動的に抽出することが可能となるため、各々の記事に目を通すことなく、話題性の高い語句のみを概観するだけで、最近の流行や話題を把握することが可能になる。 As described above, according to the present invention, a collection of a large number of documents that are high in speed as a news article and that increases from time to time is automatically extracted, and words with high topicality are automatically extracted from the words in the documents. Therefore, it is possible to grasp recent trends and topics only by reviewing only high-topic words and phrases without looking through each article.

また、時間と共に増減する語句の出現頻度（語句が出現する文書数）に対し、現在を起点として近い過去においては正の重みを、遠い過去には負の重みを持つようなフィルタを適用することで現在における話題の盛り上がりを検出することができる。ここで使用するフィルタで負の効果を正の効果より大きくすることにより、話題抽出の誤差を減少させ、より大きな話題の変化を捉えることができる。 Also, apply a filter that has a positive weight in the near past and a negative weight in the far past with respect to the appearance frequency of words that increase or decrease over time (the number of documents in which the word appears). Can detect the excitement of the current topic. By making the negative effect larger than the positive effect in the filter used here, it is possible to reduce the topic extraction error and capture a larger topic change.

また、フィルタの時間帯や曜日によって文書数が増減する性質の情報ソースを用いる場合であっても、現時点まで持続している話題の検出を効果的に行うことができる。 Further, even when an information source whose property is such that the number of documents increases or decreases depending on the time zone of the filter or the day of the week is used, it is possible to effectively detect a topic that has continued to the present time.

以下、図面と共に、本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における話題抽出装置の構成を示す。 FIG. 3 shows the configuration of the hot topic extraction apparatus according to an embodiment of the present invention.

同図に示す話題抽出装置１０には、本装置の入力となる文書データを蓄積する文書データバッファ１１と、本装置１０が出力する語句とその話題度を記録する出力語句記録装置１６が接続されている。 Connected to the topic extraction device 10 shown in FIG. 1 is a document data buffer 11 for storing document data to be input to the device, and an output phrase recording device 16 for recording the words output by the device 10 and the topic level. ing.

話題抽出装置１０は、文書解析部１２、全語句蓄積データベース１３、トリガ発行部１４、並びに語句話題度算出部１５で構成される。 The topic extraction device 10 includes a document analysis unit 12, an all-phrase accumulation database 13, a trigger issue unit 14, and a phrase topic level calculation unit 15.

文書データバッファ１１には、新しく作成された文書に一意の文書ＩＤを与え、作成時刻情報と共に、次々と入力して記録しておく、例えば、インターネット上のニュースサイトで公開されている記事のように、新しい情報を含む文書が逐次更新される情報源を対象とするのが望ましい。この場合には、サイト内の文書データの更新状況を監視し、更新された時刻を文書作成時刻とみなして収集するとよい。文書データバッファ１１は、入力文書を一時的に蓄えるキューであり、ここに蓄積される文書データは、文書解析部１２へ送出されるのを待つ。 The document data buffer 11 is assigned a unique document ID to a newly created document, and is input and recorded one after another along with creation time information, for example, like an article published on a news site on the Internet. In addition, it is desirable to target information sources in which documents containing new information are sequentially updated. In this case, it is preferable to monitor the update status of document data in the site and collect the update time as the document creation time. The document data buffer 11 is a queue that temporarily stores input documents, and waits for the document data stored therein to be sent to the document analysis unit 12.

文書解析部１２は、文書データバッファ１１に蓄積されている文書データを１文書分ずつ取得し、テキスト解析を行う。入力された文章に対して形態素解析を行い、品詞毎に分解する。このとき、必要に応じて連続する名詞を連結して複合名詞とし、複合名詞を１個の名詞として扱ってもよい。話題を表す語句としては、「歩く」「指示する」などの動詞や「青い」「高い」などの形容詞より名詞（または複合名詞）が適しているため、文書解析部１２は、名詞（または、複合名詞）のみを文書から切り出す。以降の説明では、名詞（または、複合名詞）を総称して「語句」と呼ぶ。 The document analysis unit 12 acquires document data stored in the document data buffer 11 for each document, and performs text analysis. Performs morphological analysis on the input text and breaks it down into parts of speech. At this time, if necessary, consecutive nouns may be combined to form a compound noun, and the compound noun may be treated as a single noun. As a word representing a topic, a noun (or compound noun) is more suitable than verbs such as “walking” and “instruct” and adjectives such as “blue” and “high”. Only compound nouns) are extracted from the document. In the following description, nouns (or compound nouns) are collectively referred to as “phrases”.

全語句蓄積データベース１３には、文書解析部１２から取得した文書の作成時刻と文書ＩＤ、その文書から切り出された語句をセットにして記録する。 In the all-word / phrase accumulation database 13, the document creation time and document ID acquired from the document analysis unit 12 and the words / phrases cut out from the document are recorded as a set.

トリガ発行部１４は、一定期間毎に話題抽出処理を行うためのトリガを生成するタイマである。予め設定する時間間隔毎に、語句話題度算出部１５に対して話題抽出処理の開始命令を送出する。語句話題度算出部１５は、トリガ発行部１４からのトリガを受けて話題抽出処理を開始する。 The trigger issuing unit 14 is a timer that generates a trigger for performing topic extraction processing at regular intervals. At each preset time interval, a topic extraction process start command is sent to the phrase topic level calculation unit 15. The phrase topic level calculation unit 15 receives the trigger from the trigger issuing unit 14 and starts topic extraction processing.

語句話題度算出部１５は、現在時刻から遡って過去の作成時刻を持つ語句の情報を全語句蓄積データベース１３から取得する。このとき、語句話題度算出部１５が決定する話題抽出対象期間内の作成時刻を持つ語句のみを取得するが、同時に、文書データバッファ１１に入力された文書数の多寡を勘案し、文書数が少ない場合には取得対象とする時刻範囲（話題抽出対象期間）を、一定数の文書量が得られるまで拡大させる。 The phrase topic level calculation unit 15 acquires the phrase information having a past creation time from the current time from the all phrase storage database 13. At this time, only the phrase having the creation time within the topic extraction target period determined by the phrase topic level calculation unit 15 is acquired, but at the same time, the number of documents is considered in consideration of the number of documents input to the document data buffer 11. When the number is small, the time range to be acquired (topic extraction target period) is expanded until a certain amount of document is obtained.

ここで取得した語句のそれぞれについて話題度を算出して出力語句記録装置１６に出力する。 The topic level is calculated for each of the acquired words and the result is output to the output word recording device 16.

語句ｗの話題度とは、現在話題になっている事柄を表す語句として語句ｗがどの程度適しているかを表す数値である。話題度が大きいほど話題性が高い。 The topic level of the phrase w is a numerical value that indicates how suitable the phrase w is as a phrase that represents the current topic. The greater the topic level, the higher the topicality.

語句話題度算出部１５は、普段から定常的に用いられる語句は話題を表す語句でないとして排除し、出現しない期間がある程度続いた後に集中的に高頻度で出現した語句や、短期間に特に集中的に出現した語句を高く評価して大きな話題度を与える機能を有するものとして、以下のような機能で構成されるのがよい。 The phrase topic level calculation unit 15 excludes a phrase that is regularly used as a phrase that expresses a topic and excludes it as a phrase that expresses a topic. It is preferable to configure the following functions as those having a function of giving a high degree of topic by highly evaluating the words that appear.

図４のグラフは、語句話題度算出部１５で語句の話題度算出を行う際に用いるフィルタである。現在時刻を起点として、現在よりＴplusだけ過去までの間（正区間と呼ぶ）では正値Ｗ＋をとり、さらに、Ｔminusだけ過去までの間（負区間と呼ぶ）では負値Ｗ−をとる。正区間の面積Ｓ＋に比べて負区間の面積Ｓ−のほうが大きい点が特徴である。 The graph of FIG. 4 is a filter used when the phrase topic level calculation unit 15 calculates the topic level of the phrase. Starting from the current time, a positive value W + is taken from the present to Tplus until the past (referred to as a positive interval), and further, a negative value W− is taken from Tminus to the past (referred to as a negative interval). The feature is that the area S− of the negative section is larger than the area S + of the positive section.

語句話題度算出部１５が話題度を算出する際に扱う語句は、全語句蓄積データベース１３に記録されている作成時刻が図４の正区間及び負区間のいずれかに含まれるような語句であり、それより過去の作成時刻を持つ語句は解析に用いない。つまり、先に述べた話題抽出対象期間とは、現在からＴplus＋Ｔminusだけ過去までの間である。但し、入力文書数が少ない場合には話題抽出対象期間を拡大させるが、その方法については後に詳しく説明する。 The phrase handled when the phrase topic level calculation unit 15 calculates the topic level is a phrase in which the creation time recorded in the all phrase storage database 13 is included in either the positive interval or the negative interval in FIG. , Words with a past creation time are not used for analysis. That is, the topic extraction target period described above is from the present to the past by Tplus + Tminus. However, when the number of input documents is small, the topic extraction target period is expanded. The method will be described in detail later.

語句話題度算出部１５は、個々の語句について、Ｔminusの期間中の出現頻度には負値Ｗ−の重みを、Ｔplusの期間中の出現頻度には正値Ｗ＋の重みを乗じて加算し、全期間の合計値を話題度とする。従って、遠い過去（負区間）における出現量と近い過去（正区間）における出現量の対比により各語句の話題性を判断することになる。ここで、語句の出現頻度としては、文書内での出現回数を用いてもよいが、以下では該語句を含む文書数を出現頻度として扱うものとする。つまり、１度でも該語句を含む文書があれば、該語句の出現文書数を１としてカウントする。これは、例えば、ニュース記事においては、主題となる人名や組織名などの固有名詞が多くの場合１度しか出現せず、文書内での出現回数と話題性の大小が直接的には結びつかないためであり、むしろ該語句を含む文書の数が多ければ、話題性が高いと考えられるためである。負区間から正区間にかけて常に一定頻度で使用され続ける語句は値が小さくなり、最近（正区間の範囲）に出現量が増加した語句の話題度は正の大きな値をとる。出現文書数が少ない語句が話題として抽出されるのを防ぎ、より大きな話題の変化を検出するため、正区間の面積Ｓ＋に比べて負区間の面積Ｓ−のほうを大きくしておく。 The word topic level calculator 15 adds the weight of the negative value W− to the appearance frequency during the period of Tminus and the weight of the positive value W + to the frequency of appearance during the period of Tplus for each word. The total value of all periods is the topic level. Therefore, the topicality of each word is determined by comparing the amount of appearance in the distant past (negative interval) with the amount of appearance in the near past (positive interval). Here, as the appearance frequency of the word or phrase, the number of appearances in the document may be used, but in the following, the number of documents including the word or phrase is treated as the appearance frequency. That is, if there is a document that includes the word even once, the number of documents in which the word appears is counted as one. This is because, for example, in news articles, proper nouns such as subject names and organization names appear only once in many cases, and the number of appearances in a document and the size of topicality are not directly linked. This is because, if the number of documents including the phrase is large, it is considered that the topicality is high. The value of a word that is always used at a constant frequency from the negative interval to the positive interval has a small value, and the topic level of a word whose amount of appearance has increased recently (the range of the positive interval) has a large positive value. In order to prevent a phrase having a small number of appearing documents from being extracted as a topic and to detect a larger topic change, the negative section area S− is made larger than the positive section area S +.

語句話題度算出部１５の算出を終えると、その結果を出力語句記録装置１６に記録する。出力語句記録装置１６には、語句とその話題度が記録されるため、話題度が大きい語句から必要に応じて複数取得すれば、現在の話題を知ることができる。 When the calculation of the word topic level calculation unit 15 is finished, the result is recorded in the output word recording device 16. Since the phrase and its topic level are recorded in the output phrase recording device 16, the current topic can be known if a plurality of phrases are acquired as needed from phrases with a high topic level.

続いて、本発明の話題抽出装置１０における実際の動作の様子を説明する。 Next, the actual operation in the topic extraction device 10 of the present invention will be described.

図５は、本発明の一実施の形態における文書解析部の処理のフローチャートでる。 FIG. 5 is a flowchart of processing of the document analysis unit according to the embodiment of the present invention.

ステップ１０１）文書解析部１２は、文書データバッファ１１に文書データが蓄積されているかを調べ、処理待ちの状態の文書が存在すれば、ステップ１０２に移行し、処理待ち状態の文書が存在しなければ、ステップ１０６に移行し、終了命令が与えられるまで当該処理を繰り返して文書が入力されるのを待機する。 Step 101) The document analysis unit 12 checks whether or not the document data is stored in the document data buffer 11. If there is a document waiting to be processed, the process goes to Step 102, and there must be a document waiting to be processed. For example, the process proceeds to step 106, and the process is repeated until an end command is given, and the process waits for a document to be input.

ステップ１０２）文書解析部１２は、１文書分のデータを文書データバッファ１１から取得する。 Step 102) The document analysis unit 12 acquires data for one document from the document data buffer 11.

ステップ１０３）次に、文書解析部１２は、取得した文書の解析を行う。文書の解析とは、入力文を品詞単位に分割する処理である。 Step 103) Next, the document analysis unit 12 analyzes the acquired document. Document analysis is a process of dividing an input sentence into parts of speech.

ステップ１０４）解析の結果、話題抽出の対象とする語句（名詞または複合名詞）が文書中に含まれるかどうかを判断する。例えば、文章が極端にも次回ために名詞や複合名詞が全く含まれない場合や解析ミスなどで名詞または複合名詞が全く取得されなかった場合には、取得した文書に関する処理を終え、再びステップ１０１に戻って次の文書の処理を行う。 Step 104) As a result of the analysis, it is determined whether or not a word (noun or compound noun) to be subject to topic extraction is included in the document. For example, if the sentence is extremely next time and no nouns or compound nouns are included at all, or if nouns or compound nouns are not acquired at all due to an analysis error, the processing related to the acquired document is finished, and step 101 is performed again. Return to and process the next document.

ステップ１０５）語句（名詞や複合名詞）が含まれた場合には、該語句を文書の作成時刻と共に、全語句蓄積データベース１３に記録し、ステップ１０６に移行する。 Step 105) When a phrase (noun or compound noun) is included, the phrase is recorded in the all-phrase accumulation database 13 together with the document creation time, and the process proceeds to Step 106.

ステップ１０６）終了命令が与えられるまでステップ１０１に戻って処理を繰り返す。 Step 106) Return to Step 101 and repeat the process until an end command is given.

図６は、本発明の一実施の形態における全語句蓄積データベース内に記録されたデータ例である。同図の例では、２００４年１月８日１６時４５分に作成された文書ＩＤ「doc00275」の文書から切り出された語句「イラク」「自衛隊派遣」「今日」「経験」が、「2004/1/8 16:45」という時刻情報、「doc00275」という文書ＩＤと共に記録されている。同図において、語句「東京」４０１と、語句「東京」４０２はいずれも「2004/1/8/ 16:48」という時刻情報を持つが、「東京」という語句が、同じ作成時刻を持つ複数の文書（doc00277とdoc00278）から抽出されたことを表す。 FIG. 6 is an example of data recorded in the all-words accumulation database in one embodiment of the present invention. In the example shown in the figure, the words “Iraq”, “Self-Defense Force dispatch”, “Today” and “Experience” extracted from the document with the document ID “doc00275” created at 16:45 on January 8, 2004 are “2004 / 1/8 16:45 ”and a document ID“ doc00275 ”are recorded. In the figure, the phrase “Tokyo” 401 and the phrase “Tokyo” 402 both have time information “2004/1/8/16: 48”, but the phrase “Tokyo” has a plurality of times with the same creation time. It is extracted from the documents (doc00277 and doc00278).

図７は、本発明の一実施の形態における話題抽出処理のフローチャートであり、上記の全語句蓄積データベース１３に記録された情報を用いて、話題抽出を行う際の処理の流れを示している。 FIG. 7 is a flowchart of topic extraction processing according to an embodiment of the present invention, and shows a flow of processing when topic extraction is performed using information recorded in the all-phrase accumulation database 13.

ステップ２０１）処理が開始されると、トリガ発行部１４が語句話題度算出部１５に対して処理の開始を命令する。 Step 201) When the process is started, the trigger issuing unit 14 instructs the word topic level calculating unit 15 to start the process.

ステップ２０２）語句話題度算出部１５は、現在を起点として過去（Ｔplus＋Ｔminus）の期間を解析時刻範囲として仮決定する。例えば、正区間を８時間、負区間を２４時間とした図８のようなフィルタを用いる場合、正区間と負区間を合わせた３２時間前から現在までが仮の解析時刻範囲となる。 Step 202) The word topic level calculation unit 15 provisionally determines the past (Tplus + Tminus) period as the analysis time range starting from the present. For example, when using a filter as shown in FIG. 8 in which the positive interval is 8 hours and the negative interval is 24 hours, the provisional analysis time range is from 32 hours before the present interval including the positive interval and the negative interval to the present.

ステップ２０３）続いて、語句話題度算出部１５は、全語句蓄積データベース１３にアクセスし、作成時刻と文書ＩＤの情報を調べることで、仮解析時刻範囲に含まれる作成時刻を持つ文書数を集計する。 Step 203) Subsequently, the word topic level calculation unit 15 accesses the all word storage database 13 and checks the information of the creation time and the document ID, thereby counting the number of documents having the creation time included in the temporary analysis time range. To do.

ステップ２０４）３２時間前から現在までの間の作成時刻を持つ文書が少なく、一定数（例えば、文書数１００）に満たない場合には、ステップ２０５に移行する。満たす場合には、ステップ２０６に移行する。ここで、過去３２時間での文書数が一定数（例えば１００）を越えた場合は、仮の解析時刻範囲である３２時間がそのまま実際の解析時刻範囲となる。 Step 204) If the number of documents having the creation time between 32 hours ago and the present is small and less than a certain number (for example, 100 documents), the process proceeds to Step 205. If so, the process proceeds to step 206. Here, when the number of documents in the past 32 hours exceeds a certain number (for example, 100), the temporary analysis time range of 32 hours becomes the actual analysis time range as it is.

ステップ２０５）過去３２時間の作成時刻を持つ文書数が一定数（１００）に満たない場合には期間内の文書数が一定数（１００）に達するまで、さらに過去へと遡って期間を延長する。 Step 205) If the number of documents having the creation time of the past 32 hours is less than a certain number (100), the period is further extended to the past until the number of documents in the period reaches a certain number (100). .

文書数が一定数に達した点の作成時刻から現在までの期間が解析時刻範囲となる。現在から４８時間前の作成時刻を持つ文書まで遡ったところで文書数が１００を越えた場合、フィルタは図９のグラフのように、正区間が１２時間、負区間が３６時間の合計４８時間の時間幅を持つフィルタとなる。 The period from the creation time of the point where the number of documents reaches a certain number to the present is the analysis time range. When the number of documents exceeds 100 after going back to the document having the creation time 48 hours before the current time, the filter is 48 hours in total, with the positive interval being 12 hours and the negative interval being 36 hours as shown in the graph of FIG. The filter has a time width.

入力文書の量によって解析時刻範囲を拡大させるのは次の理由による。 The reason for expanding the analysis time range according to the amount of input documents is as follows.

本発明の話題抽出装置で使用する文書データは、人間が記述するテキストである。このため、夜間に作成される文書は少なくなる。また、為替関連の文書は平日の昼間に多く作成される傾向がある。国内のスポーツの試合結果なども夜間の作成は少ない。曜日や時間によらず解析時刻範囲を一定にすると話題が全く抽出されない場合があるため、動的に解析時刻範囲の変更を行う。 Document data used in the topic extraction device of the present invention is text written by a human. For this reason, fewer documents are created at night. Also, many exchange-related documents tend to be created during the daytime on weekdays. There are few nighttime creations of sports results in Japan. If the analysis time range is made constant regardless of the day of the week or time, the topic may not be extracted at all, so the analysis time range is dynamically changed.

なお、以下の説明では、解析時刻範囲が拡大され、図９のようになったフィルタを使用する場合の処理について説明する。 In the following description, processing when the analysis time range is expanded and the filter shown in FIG. 9 is used will be described.

ステップ２０６）ここまでで決定した解析時刻範囲内の未取得語句があるかを判定し、ある場合には、ステップ２０７に移行する。 Step 206) It is determined whether there is an unacquired phrase within the analysis time range determined so far, and if there is, the process proceeds to Step 207.

ステップ２０７）語句話題度算出部１５は、全語句蓄積データベース１３にアクセスし、解析時刻範囲内の作成時刻を持つ語句があれば取得する。現在時刻が２００４年１月８日１７時００分であった場合、３２時間前の２００４年１月７日９時００分以降が解析時刻範囲であるため、図６において、例えば「2004/1/8/16:48」という作成時刻を持つ「大蔵大臣」という語句は取得の対象となる。 Step 207) The phrase topic level calculation unit 15 accesses the all phrase storage database 13 and acquires any phrase having a creation time within the analysis time range. When the current time is 18:00 on January 8, 2004, the analysis time range is after 9:00 on January 7, 2004, 32 hours ago. The phrase “Minister of Finance” with the creation time of “/ 8/16: 48” is eligible for acquisition.

ステップ２０８）作成時刻「2004/1/8 16:48」は現在時刻の１２分前にあたり、図９の正区間の範囲内であるため、重みは「１」となる。この値「１」を語句話題度算出部１５内のバッファに、語句「大蔵大臣」の話題度の暫定値として一時的に記憶しておく。 Step 208) Since the creation time “2004/1/8 16:48” is 12 minutes before the current time and is within the range of the positive section in FIG. 9, the weight is “1”. This value “1” is temporarily stored in the buffer in the phrase topic level calculation unit 15 as a provisional value of the topic level of the phrase “Minister of Finance”.

ステップ２０６に移行し、ステップ２０６〜２０８の処理を、解析時刻範囲の作成時刻、即ち、３２時間前以降の作成時刻を持つ全ての語句について行う。「大蔵大臣」という語句が他の時刻でも出現した場合は、上記と同様に作成時刻から重みを決定し、語句話題度算出部１５内のバッファに加えて保持しておく。 The process proceeds to step 206, and the processing in steps 206 to 208 is performed for all words having the creation time of the analysis time range, that is, the creation time after 32 hours. When the phrase “Minister of Finance” appears at other times, the weight is determined from the creation time in the same manner as described above, and is stored in addition to the buffer in the phrase topic degree calculation unit 15.

ステップ２０９）ステップ２０６において、全ての語句の話題度の暫定値が終了した時点で語句話題度算出部１５内のバッファに語句毎に保持されている暫定値が、各語句の話題度である。従って、この値を出力語句記録部１６に記録する。 Step 209) In step 206, when the provisional values of the topic level of all the phrases are completed, the provisional value held for each phrase in the buffer in the phrase topic level calculation unit 15 is the topic level of each phrase. Therefore, this value is recorded in the output phrase recording unit 16.

ここまでの処理により、解析時刻範囲である過去４８時間に出現した全ての語句の話題度が算出され、出力語句記録装置１６に図１０に示すように記録される。この中で大きな話題度「１２．７５」をもつ「イラク」や「１０．００」を持つ「自衛隊派遣」が現在時刻２００４年１月８日１７時００分時点での話題を強く表す語である。 By the processing so far, the topic levels of all the words / phrases appearing in the past 48 hours which are the analysis time range are calculated and recorded in the output word / phrase recording device 16 as shown in FIG. Among them, “Iraq” with a big topic level “12.75” and “SDF dispatch” with “10.00” are words that strongly express the topic at 17:00 on January 8, 2004. is there.

ステップ２１０）終了命令があれば、ここで処理を終了する。終了命令がない場合には、ステップ２１１に移行する。 Step 210) If there is an end command, the process ends here. If there is no end command, the process proceeds to step 211.

ステップ２１１）終了命令がない場合には、トリガ発行部１４が予め設定しておく一定期間だけ待機し、ステップ２０１に移行して、再び話題度算出処理の開始トリガを発行する。 Step 211) If there is no end command, the trigger issuing unit 14 waits for a predetermined period of time set in advance, moves to Step 201, and issues a topic level calculation process start trigger again.

以上の処理を繰り返すことで、一定時間毎に最新の話題を常に出力し続ける。 By repeating the above processing, the latest topic is always output at regular intervals.

また、本発明は、上記の図５、図７に示す動作をプログラムとして構築し、話題抽出装置として利用されるコンピュータにインストールし、ＣＰＵ等の制御手段により実行させる、または、ネットワークを介して取得して実行させることも可能である。 In the present invention, the operation shown in FIGS. 5 and 7 is constructed as a program, installed in a computer used as a topic extraction device, executed by a control means such as a CPU, or acquired through a network. It is also possible to execute it.

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

本発明は、文書中から最近の話題となっている語句を自動的に抽出する技術に適用可能である。 The present invention can be applied to a technique for automatically extracting words and phrases that have become a recent topic from a document.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における話題抽出装置の構成図である。It is a block diagram of the hot topic extraction apparatus in one embodiment of this invention. 本発明の一実施の形態における語句話題度算出部のフィルタの例である。It is an example of the filter of the phrase topic degree calculation part in one embodiment of this invention. 本発明の一実施の形態における文書解析部の処理のフローチャートである。It is a flowchart of the process of the document analysis part in one embodiment of this invention. 本発明の一実施の形態における全語句蓄積データベースの例である。It is an example of the all phrase storage database in one embodiment of this invention. 本発明の一実施の形態における話題抽出処理のフローチャートである。It is a flowchart of the topic extraction process in one embodiment of the present invention. 本発明の一実施の形態における正区間を８時間、負区間を２４時間としたフィルタの例である。It is an example of the filter which made the positive area 8 hours and the negative area 24 hours in one embodiment of this invention. 本発明の一実施の形態における解析時刻範囲を拡大させたフィルタの例である。It is an example of the filter which expanded the analysis time range in one embodiment of this invention. 本発明の一実施の形態における出力語句記録装置に格納される語句と話題度の例である。It is an example of the phrase and topic level stored in the output phrase recording device in one embodiment of the present invention.

Explanation of symbols

１０話題抽出装置
１１文書データバッファ
１２文書解析部
１３全語句蓄積データベース
１４トリガ発行部
１５語句話題度算出部
１６出力語句記憶手段、出力語句記録装置
１５１語句取得手段
１５２語句話題度算出手段 DESCRIPTION OF SYMBOLS 10 Topic extraction apparatus 11 Document data buffer 12 Document analysis part 13 All phrase storage database 14 Trigger issue part 15 Phrase topic degree calculation part 16 Output phrase memory | storage means, Output phrase recording device 151 Phrase acquisition means 152 Phrase topic degree calculation means

Claims

A topic extraction method for analyzing a number of documents having creation time information and quantifying and outputting the topical intensity of the words by examining the frequency of appearance of the words that change with time for the words in the documents In
The phrase acquisition unit sets a phrase having a creation time within an analysis time range from a recorded all-phrase storage database by combining a document creation time, an ID that uniquely identifies the document, and a phrase that appears in the document. A phrase acquisition step to acquire;
Phrase topic degree calculation means, starting from the current time, the interval from Tplus to the past time (positive interval), and from the current time to the past time by Tplus, the interval from Tminus to the past time (negative interval) and said analysis time range a section consisting of a positive value with respect to the positive zone, the negative values for negative interval, said at negative section than the absolute value of the integrated value of the positive values in the positive section The absolute value of the negative integrated value is set to be larger, and the positive value is added and stored in the buffer for the frequency of occurrence of the word that appears in the positive interval. The negative value is added and stored in the buffer for the appearance frequency of the word that appears in the section, and finally, the added positive value and negative value stored in the buffer are added to the word. A word topic level calculation step for calculating the topic level of
An output means for outputting the calculated topic level to an output word storage means;
Consists of
The phrase topic degree calculation step includes:
A topic extraction method characterized by expanding the analysis time range when the number of documents having the creation time within the analysis time range is less than a predetermined number.

A topic extraction device that analyzes a large number of documents having creation time information and quantifies and outputs the intensity of topicality of the phrases by investigating appearance frequency transitions of the phrases that change with time for the phrases in the documents Because
An all-phrase storage database that records a creation time of a document, an ID that uniquely identifies the document, and a phrase that appears in the document;
A phrase acquisition means for acquiring a phrase having a creation time within an analysis time range from the all phrase storage database;
The analysis time is a section consisting of a section from the current time to Tplus past time (positive section) and a section from the current time to Tplus past time to Tminus from the current time to the past time (negative section). in the range, positive values with respect to the positive zone, the negative values for negative interval, the absolute value of the integrated value of the negative value in the negative zone than the absolute value of the integrated value of the positive values in the positive section Is set to be larger, and the positive frequency is added to the appearance frequency of the phrase appearing in the positive section and stored in the buffer, and the appearance of the phrase appearing in the negative section The phrase topic that adds the negative value for the frequency and stores it in the buffer, and finally calculates the topic level of the phrase by adding the added positive and negative values stored in the buffer Degree calculation means;
Output means for outputting the calculated topic level to output phrase storage means;
Have
The phrase topic degree calculating means includes:
A topic extraction device comprising means for expanding the analysis time range when the number of documents having the creation time within the analysis time range is less than a predetermined number.

A topic extraction program for causing a computer to function as means for constituting the topic extraction device according to claim 2.