WO2012111226A1

WO2012111226A1 - Time-series document summarization device, time-series document summarization method and computer-readable recording medium

Info

Publication number: WO2012111226A1
Application number: PCT/JP2011/078517
Authority: WO
Inventors: 岡嶋穣; 中澤聡; 河合剛巨
Original assignee: 日本電気株式会社
Priority date: 2011-02-15
Filing date: 2011-12-09
Publication date: 2012-08-23
Also published as: JP5884740B2; US20130311471A1; JPWO2012111226A1

Abstract

This time-series document summarization device (201) outputs a summary sentence of a document-of-interest collection which is a document collection to be summarized. The time-series document summarization device (201) is provided with: a background topic word extraction unit (20) which acquires a set of the document-of-interest collection and a document-of-interest topic word, which is a feature word of the document-of-interest collection, as well as a reference-use document collection, which is a document collection which is different from the document-of-interest collection, in order to extract a background topic word representing a topic which is the background of a topic described in the document-of-interest collection from the reference-use document collection; and a representative character string extraction unit (30) for extracting a representative character string, which includes the document-of-interest topic word and the background topic word, as the summary sentence of the document-of-interest collection from among character strings included in the document-of-interest collection.

Description

Time-series document summarization apparatus, time-series document summarization method, and computer-readable recording medium

The present invention relates to a time-series document summarization apparatus, a time-series document summarization method, and a computer-readable recording medium, and more particularly to a time-series document summarization apparatus, a time-series document summarization method, and a method for summarizing topics in a document set and presenting them to a user. The present invention relates to a computer-readable recording medium.

In recent years, due to the development of the Internet, a large number of documents such as news articles and blog articles are generated day and night and are released. Therefore, there is a need for a new technique for summarizing the contents of such a large amount of time series documents.

Trend analysis technology is known as a technology for extracting and summarizing matters that have become a hot topic from a large amount of time-series documents. Trend analysis is a technology that analyzes what is being talked about in each period from a large number of documents generated in time series, such as news articles and blog articles, and presents it to the user. .

In the trend analysis technology, it is common to express the topic of a period of interest by extracting and outputting feature words that appear frequently in the document set belonging to that period.

Okumura Manabu, Minano Yasuyuki, Fujiki Yasuaki, Suzuki Yasuhiro, “Text Mining Based on Automatic Collection and Monitoring of Blog Pages”, Technology described in the Japanese Society for Artificial Intelligence SIG-SW & ONT-A401-01, 2004 (Non-patent Document 1) Then, by determining whether or not the appearance interval of a document including a certain word is shorter than usual, feature words that appear more frequently in a specific period are extracted.

Furthermore, it is easy to extract a sentence including the feature word for the feature word of the target period extracted using the technique described in Non-Patent Document 1. A sentence including this feature word can be output as a summary sentence representing the topic in that period.

As an example, "Yahoo! Blog Search", [online], [August 23, 2010 search], Internet <URL: http://blog-search.yahoo.co.jp/> (Non-patent Document 2) There are services listed in. In this service, a feature word at the current time is displayed on the top page, and when the displayed feature word is clicked, a transition is made to a search page and a part of a sentence including the clicked feature word is displayed. This is equivalent to presenting to the user a sentence including a feature word in the period of interest as a sentence for explaining the topic in that period.

In addition, the technology described on pages 22 to 23 of Manabu Okumura, Eizo Namba, “Science of Science Text Automatic Summarization”, Ohmsha, 2005 (Non-Patent Document 3) extracts sentences including feature words of documents. This is a technique for creating a summary. By applying this technique to a set of documents belonging to a certain period, it is possible to present a summary sentence that explains the topic of that period.

As described above, there is a technique for extracting a sentence including a characteristic word of a certain period and presenting it as a summary sentence explaining the topic of the period.

Further, as an example of a technique for processing a topic word, Japanese Patent Laid-Open No. 2006-139718 (Patent Document 1) discloses the following technique. That is, when a topic word and document information related to the topic word are read, a document related to a certain topic word and a document related to another topic word are determined by the topic word combination rule stored in the topic word combination storage unit. The degree of document sharing with is calculated. Next, topic words that can be combined are selected based on the document sharing level, and the selected topic words are combined to form a topic word group together with the document sharing level. Next, based on the representative word extraction rule, the representative words of the combined topic word groups are extracted.

Further, Japanese Patent Laid-Open No. 2007-140602 (Patent Document 2) discloses the following technique. In other words, for each word included in the processing target document, the word obtained by acquiring the degree of relevance between the source of the processing target document and the source that has used the word from the relevance database and totaling it. Relevance distribution with the user, and relevance distribution with other transmission sources obtained by acquiring and totaling the relevance between the transmission source of the document to be processed and other transmission sources from the relevance database Contrast. Then, the amount representing the degree of use of a large number of transmission sources having a high degree of association with the transmission source of the processing target document is set as the topic level of the word.

Further, Japanese Patent Laid-Open No. 2008-152634 (Patent Document 3) discloses the following technique. That is, the time series frequency vector of each word is generated by counting the temporal change in appearance frequency of words appearing in a plurality of document sets. The time-series frequency vector of the generated word is analyzed, and a word whose frequency increases rapidly is extracted as a candidate word that is a candidate for a potential topic. A main topic time-series frequency vector is generated by quantifying the number of documents acquired every time for topics whose number of documents exceeds a predetermined threshold among topics included in the document set. Then, the inter-vector distance between the time series frequency vector of each candidate word and the main topic time series frequency vector is calculated, and a word having a large distance is extracted as a latent topic word.

JP 2006-139718 A JP 2007-140602 A JP 2008-152634 A

By the way, a new service called microblogging like Twitter has begun to spread. In such a microblog, a user often posts a sentence assuming a reader who shares a specific small number of background information.

Therefore, compared to conventional news articles and blog articles, the part that explains the background is often omitted, such as conversation between close friends.

In the case of using conventional technology that selects sentences containing feature words as summary sentences based on the statistical appearance tendency of words or expressions, sentences that do not include parts that explain the background stochastically are summarized sentences. Easy to be sorted as. However, for general readers who do not know the original background, there is a problem that it is not appropriate as a summary sentence because it cannot understand what the sentence is written about.

Further, Non-Patent Documents 1 to 3 and Patent Documents 1 to 3 do not disclose a configuration for solving such a problem.

The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a time-series document summarization apparatus, a time-series document summarization method, and a computer-readable computer that can output an appropriate summary sentence from a set of documents. It is to provide a possible recording medium.

In order to solve the above-described problem, a time-series document summarization apparatus according to an aspect of the present invention is a time-series document summarization apparatus for outputting a summary sentence of a target document set that is a target document set. A target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set. A background topic word extracting unit for extracting a background topic word representing a topic as a background of a topic from the reference document set, and from the character string included in the target document set, A representative character string extracting unit for extracting a representative character string including a background topic word as a summary sentence of the target document set;

In order to solve the above-described problem, a time-series document summarization method according to an aspect of the present invention is a time-series document summarization method for outputting a summary sentence of a target document set which is a target document set, and the target document A topic and a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set, and a topic described in the target document set A step of extracting a background topic word representing a topic as a background of the reference document set from the reference document set, and a representative character string including the target document topic word and the background topic word from the character strings included in the target document set Is extracted as a summary sentence of the target document set.

In order to solve the above problems, a computer-readable recording medium according to an aspect of the present invention is used in a time-series document summarization apparatus for outputting a summary sentence of a target document set which is a target document set. A computer-readable recording medium on which a series document summary program is recorded, wherein the time series document summary program stores a set of a target document topic word that is a feature word of the target document set and the target document set. And a reference document set that is a document set different from the target document set, and background topic words representing topics that are backgrounds of topics described in the target document set are extracted from the reference document set And a representative including the target document topic word and the background topic word from the character string included in the target document set. The string is a program for executing a step of extracting as a summary of the interest document set.

According to the present invention, an appropriate summary sentence can be output from a set of documents.

It is a figure which shows the example of the topic of the day in a microblog. It is a figure which shows the sentence containing the characteristic word and characteristic word of each period about the example of FIG. 1 is a schematic configuration diagram of a time-series document summarizing device according to an embodiment of the present invention. It is a block diagram which shows the control structure which the time series document summarization apparatus which concerns on the 1st Embodiment of this invention provides. It is a flowchart which shows the operation | movement procedure at the time of the time series document summarization apparatus concerning embodiment of this invention performing a time series document summary process. It is a figure which shows the example of the data which the attention document topic word extraction part 10 outputs. It is a figure which shows the example of the data which the background topic word extraction part 20 outputs. It is a figure which shows the example of the summary score of the character string in the representative character string extraction part. It is a figure which shows the example of the data which the representative character string extraction part 30 outputs.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals and description thereof will not be repeated.

First, in order to facilitate understanding of the present invention, problems to be solved by the present invention will be described in detail.

文章 Human sentences are considered to consist of two parts. That is, a part explaining “background” indicating what the sentence describes and a part explaining “new information” that the writer wants to convey by the sentence. This is not limited to text written in writing, but is also true for verbal utterances.

Here, the “background” refers to the pre-requisite topics and the objects to be described that are necessary for understanding the text.

On the other hand, “new information” refers to matters that the author wants to assert through the text, such as descriptions of new facts, opinions and impressions, etc., regarding the topic and subject matter explained as background.

Note that the term “new information” is generically used here, but this “new information” refers to information that the author wants to convey to the reader or information that the author wants to claim, and is not necessarily completely unknown to the reader. It does not have to be limited to information.

That is, even if the part of the text that I want to convey to the reader is a reconfirmation of the fact that the reader may already know, that part should be included widely in the new information. Moreover, it is not an explanation of the fact but may be an opinion or impression of the writer.

Suppose, for example, that a news article the day after the match between Japan vs. Denmark in the World Cup of Soccer says “Japan won the match against Japan vs. Denmark in the football World Cup 3: 1”. At this time, the part of “Japan vs Denmark in the soccer World Cup” is an explanation of the background of what the text is written about, and the part of “Japan wins 3: 1” It is a description of new information that the author wants to convey through sentences.

The main part I want to convey through the text is the explanation of the new information. Since the description of the background is not new information, it can be omitted when the information is transmitted to a specific partner who already shares the background information.

On the other hand, when transmitting information to a large number of unspecified parties who do not necessarily share background information, it is necessary to explain not only new information but also the background that is the premise.

For example, the news article assumes an unspecified number of readers who do not always share background information, so “Japan won 3 vs 1 in Japan vs. Denmark in the Soccer World Cup. ”Describes new information after explaining the background.

On the other hand, when close friends are talking on the next day of the match, it is natural to talk to Japan as “3 to 1!” Without explaining the background. This is the next day of the game, it is obvious what you are talking about without any explanation, and even if you omit the background, you will see what the opponent is talking about Based on expectations.

In this way, the more detailed the background explanation is, the more the public text (speech) that is communicated to the unspecified majority, and the more the private text (speech) that is communicated to the specified minority, the more the background explanation is omitted. There is a tendency.

The traditional trend analysis technology has targeted news articles and blog articles. Sentences contained in these documents are open to the public and are intended to be read by an unspecified number of people, even if the content that I want to convey is read by an unspecified number of readers. As can be seen, explanations of background topics are often included in the document.

For this reason, when news articles and blog articles are to be analyzed as in the past, it is necessary to extract a sentence including many characteristic words from a summary target document using the techniques listed in Non-Patent Documents 1 to 3. It was possible to output a summary that was appropriate for a large number of unspecified readers, including explanations of the topics.

On the other hand, a new type of service called microblogging has become very popular in recent years. Twitter is a typical example. Microblogging is a service that allows individuals to post their own texts, just like blogs. The user can post a short sentence of about 140 characters at the maximum. With microblogging, people can easily post what they thought of on the Internet in real time.

In these microblogs, there are many posts that are supposed to be read only by specific people who are registered to read the user's text, called followers, and usage that is close to private daily conversation is widespread is doing. Except for some exceptions, the number of users being followed is around tens to hundreds, and users may post text assuming a specific small number of readers sharing background information. it can.

Because of these characteristics, microblogs contain a large number of sentences that are intended for a specific number of readers when a large number of sentences posted on microblogs are accumulated, compared to the accumulation of conventional news articles and blogs. It is thought that there is. And in such a sentence, the part used as the description regarding a background is often abbreviate | omitted like the conversation between close friends.

It is difficult to output an appropriate summary sentence by a technique in which many sentences posted on such microblogs are accumulated and a sentence including feature words is simply extracted using conventional techniques.

The reason is as follows. That is, in microblogs, there are very many sentences for a specific small number of readers, and most sentences included in microblogs are sentences that do not explain the background topic. Therefore, even if a sentence including a feature word is selected as a summary sentence based on a statistical appearance tendency of a word or expression, a sentence that does not include a portion that is stochastically explained as a background explanation is easily selected.

However, the majority of readers who do not know the original background are presented with these sentences as a summary of the original document set, and even if they read it, they do not understand what it was written about. Such a sentence becomes inappropriate as a summary sentence.

Suppose, for example, that a game of Japan VS Denmark of the soccer World Cup was being broadcast on television. Suppose further that the second goal has just been decided in the game. In this case, “having a shot” and “having a goal” are new information at the current time. On the other hand, “Soccer World Cup”, “Japan VS Denmark”, and the like are background topics that specify what “shooting” and “goal” are all about.

At this time, the microblog posts a lot of sentences that convey only the current new information, such as “Oh, I decided to shoot” and “I did it, goal” but omitted the explanation of the background. The contributors to these texts are posting to a small number of readers who share backgrounds that can guess what they are writing about. In many cases, it is assumed that the timing at which the posted text is read is not greatly deviated from the time of posting.

On the other hand, sentences containing explanations of background topics such as "Japan vs Denmark in the soccer World Cup have just reached their second goal." It becomes a small number when you see it. This is because such explanatory text is used in public media and not in private text or conversation.

For these reasons, frequent words such as “shoot” and “goal” are widely extracted as feature words at that time in microblogging, but backgrounds such as “Soccer World Cup”, “Japan” and “Denmark” The number of words indicating the topic becomes less frequent and difficult to extract as feature words.

As a result, just extracting sentences that contain a lot of feature words for a certain period of time from a microblog, only feature words that represent new information such as “Shoot decided” and “Goal, happy.” And a sentence that does not include a word representing a background topic tends to be easily extracted as a summary sentence. Such a summary composed only of new information is difficult to understand for third-party readers who do not know the background topic, and is not suitable as a summary.

As described above, it is not possible to output an appropriate summary sentence that can be easily understood by an unspecified number of general readers from a microblog by simply extracting a sentence including a feature word using conventional technology.

Furthermore, a specific example of this problem will be described using FIG. 1 and FIG.

FIG. 1 is a diagram showing an example of a topic of a day in a microblog. FIG. 2 is a diagram illustrating the feature words of each period and a sentence including the feature words in the example of FIG.

FIGS. 1 and 2 illustrate changes in topics in a set of documents posted during a day on a microblog. One day is divided into six periods every four hours, and for each period, one sentence summarizing topics included in documents posted in that period is output. Therefore, it is assumed that a total of six summary sentences are output per day.

Fig. 1 shows the result of a human worker reading and analyzing a posted document and examining what has become a topic. This day was the day when various parts of Japan were hit by heavy rain, and it was filled with topics related to heavy rain in three time zones: “4 am-8pm”, “12 am-16:00” and “16: 00-20am”. I understand that.

Since the topics of “12: 00-16: 00” and “16: 00-20: 00” are topics of heavy rain following the first “4-8-8”, “12-16: 00” and “16:00” When summarizing the “-20 o'clock” period, it is desirable to output a summary sentence that includes an explanation of the background topic.

FIG. 2 shows the result of extracting feature words in each period and a sentence including the feature words for the same document set as FIG. The sentence shown in FIG. 2 has not been able to output a summary sentence including an explanation of a topic that is the background of heavy rain.

In other words, "Today is a heavy rain warning because of heavy rain", "Train has stopped" and "Kinkakuji is supposed to be dangerous", and certainly every sentence includes a characteristic word of each period Yes. However, just reading these extracted sentences does not make sense that these three events have a common background of heavy rain.

This method cannot output a summary sentence that includes the explanation of the topic that is the background. When generating a summary sentence for each period, it is necessary to include the feature word for the period of interest. It is because it considers only. For this reason, it is necessary to further add a condition that becomes a summary sentence including the explanation of the background topic.

Based on the above idea, the time-series document summarization apparatus according to the embodiment of the present invention uses the characteristic words of the past period as a clue rather than the period of interest. As a result, it is possible to output a summary sentence that summarizes the topic of a certain period and includes the explanation of the topic as a background from a large amount of documents having time information.

The time-series document summarization apparatus 201 according to the embodiment of the present invention typically has a computer having a general-purpose architecture as a basic structure, and executes various programs as will be described later by executing a preinstalled program. Provide functionality. Generally, such a program is stored in a recording medium such as a flexible disk and a CD-ROM (Compact Disk Read Only Memory) or distributed via a network or the like. When such a general-purpose computer is used, an OS (Operating System) for providing basic functions of the computer is provided in addition to the application for providing the functions according to the embodiment of the present invention. It may be installed. In this case, the program according to the embodiment of the present invention executes processing by calling necessary modules out of program modules provided as a part of the OS in a predetermined order and / or timing. May be. That is, the program itself according to the embodiment of the present invention does not include the module as described above, and the process may be executed in cooperation with the OS. Therefore, the program according to the embodiment of the present invention may have a form that does not include the above-described module.

Furthermore, the program according to the embodiment of the present invention may be provided by being incorporated in a part of another program such as an OS. Even in this case, the program itself according to the embodiment of the present invention does not include a module included in the other program as described above, and the process is executed in cooperation with the other program. That is, the program according to the embodiment of the present invention may be in a form incorporated in such another program.

Alternatively, some or all of the functions provided by program execution may be implemented as a dedicated hardware circuit.

[Device configuration]
FIG. 3 is a schematic configuration diagram of the time-series document summarizing apparatus according to the embodiment of the present invention.

Referring to FIG. 3, time-series document summarization apparatus 201 is an information processing apparatus such as a portable information terminal, personal computer, and server, and includes a CPU (Central Processing Unit) 101 that is an arithmetic processing unit, a main memory 102, and a hard disk. 103, an input interface 104, a display controller 105, a data reader / writer 106, and a communication interface 107. These units are connected to each other via a bus 121 so that data communication is possible.

The CPU 101 performs various operations by developing programs (codes) stored in the hard disk 103 in the main memory 102 and executing them in a predetermined order. The main memory 102 is typically a volatile storage device such as a DRAM (Dynamic Random Access Memory), and stores data indicating various arithmetic processing results in addition to the program read from the hard disk 103. To do. The hard disk 103 is a non-volatile magnetic storage device, and stores various setting values in addition to programs executed by the CPU 101. The program installed in the hard disk 103 is distributed in a state of being stored in the recording medium 111 as will be described later. In addition to the hard disk 103 or instead of the hard disk 103, a semiconductor storage device such as a flash memory may be employed.

The input interface 104 mediates data transmission between the CPU 101 and an input unit such as a keyboard 108, a mouse 109, and a touch panel (not shown). That is, the input interface 104 receives an external input such as an operation command given by the user operating the input unit.

The display controller 105 is connected to a display 110 that is a typical example of a display unit, and controls display on the display 110. That is, the display controller 105 displays the result of image processing by the CPU 101 to the user. The display 110 is, for example, an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube).

The data reader / writer 106 mediates data transmission between the CPU 101 and the recording medium 111. That is, the recording medium 111 circulates in a state where a program executed by the time-series document summarizing apparatus 201 is stored, and the data reader / writer 106 reads the program from the recording medium 111. Further, the data reader / writer 106 writes the processing result in the time-series document summarizing apparatus 201 into the recording medium 111 in response to the internal command of the CPU 101. The recording medium 111 may be, for example, a general-purpose semiconductor storage device such as CF (Compact Flash) and SD (Secure Digital), a magnetic storage medium such as a flexible disk, or a CD-ROM (Compact Disk Read Only). Memory).

The communication interface 107 mediates data transmission between the CPU 101, a personal computer, a server device, and the like. The communication interface 107 typically has an Ethernet (registered trademark) or USB (Universal Serial Bus) communication function. Instead of installing the program stored in the recording medium 111 in the time-series document summarizing apparatus 201, a program downloaded from a distribution server or the like via the communication interface 107 may be installed in the time-series document summarizing apparatus 201. Good.

Also, the time series document summarization apparatus 201 may be connected to another output device such as a printer as necessary.

[Control structure]
Next, a control structure for providing various functions in the time-series document summarizing apparatus 201 will be described.

FIG. 4 is a block diagram showing a control structure provided by the time-series document summarizing apparatus according to the first embodiment of the present invention.

4 is provided by developing a program (code) stored in the hard disk 103 in the main memory 102 and causing the CPU 101 to execute it. Note that some or all of the modules shown in FIG. 4 may be provided by firmware implemented in hardware. Alternatively, part or all of the control structure shown in FIG. 4 may be realized by dedicated hardware and / or a wiring circuit.

Referring to FIG. 4, the time-series document summarizing apparatus 201 includes a target document topic word extraction unit 10, a background topic word extraction unit 20, and a representative character string extraction unit 30 as its control structure.

The time-series document summarizing apparatus 201 accepts a document set with time information as an input. A document set with time information is a set of documents in which documents included in the set are associated with some time. The time associated with each document represents the time when the document was created, the time when it was transmitted, and the like. The time may be described in any granularity such as year, month, day, hour, minute, and second.

Examples of document sets with time information received as input by the time-series document summarization apparatus 201 include news articles, blogs, microblogs, and documents posted on electronic bulletin boards.

The time series document summarization apparatus 201 summarizes the topics of the input document set. This input document set is called a target document set. That is, the time-series document summarization apparatus 201 creates a summary sentence of a target document set that is a target document set.

In the time-series document summarizing apparatus 201, the target document topic word extraction unit 10 sets the input document set with time information as the target document set. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it.

The background topic word extraction unit 20 sets a document set different from the target document set as a reference document set. For example, this document set is different from a document set that is a dictionary such as a term dictionary. The document set for reference may be a document set with time information or a document set without time information.

The background topic word extraction unit 20 extracts, from the reference document set, feature words representing topics in the past period as the background topic word from the period of the document set of interest. Then, the background topic word extraction unit 20 calculates a relevance level representing the relevance between the extracted background topic word and the target document topic word output from the target document topic word extraction unit 10, and calculates the calculated relevance level. , And background topic words are output.

The representative character string extraction unit 30 adds the background topic word extracted by the background topic word extraction unit 20 and the calculated relevance degree in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extraction unit 10. Is used to extract a representative character string representing the topic of the document set of interest.

[Operation]
Next, the operation of the time-series document summarizing apparatus according to the embodiment of the present invention will be described with reference to the drawings. In the embodiment of the present invention, the time series document summarization method according to the embodiment of the present invention is implemented by operating the time series document summarization apparatus 201. Therefore, the description of the time-series document summarization method according to the embodiment of the present invention is replaced with the following description of the operation of the time-series document summarization apparatus 201. In the following description, FIG. 4 will be referred to as appropriate.

In the time-series document summarizing apparatus 201, the document-of-interest topic word extraction unit 10 acquires the document-of-interest collection, and extracts a word representing the topic of the document-of-interest included in the document-of-interest collection as a document-of-interest topic word.

The background topic word extraction unit 20 is a document set that is different from the target document set and the set of target document topic words that are characteristic words of the target document set extracted by the target document topic word extraction unit 10. Get reference document set. For example, the background topic word extraction unit 20 acquires, as a reference document set, a document set including documents created or released in the past from the target document set.

Then, the background topic word extraction unit 20 extracts a background topic word representing a topic that is a background of the topic described in the target document set from the reference document set. For example, the background topic word extraction unit 20 extracts many words included in the reference document set or words included in a biased manner as background topic words.

The representative character string extraction unit 30 extracts a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set.

More specifically, the background topic word extraction unit 20 calculates the degree of association between the target document topic word and the background topic word. For example, the background topic word extraction unit 20 relates to the relationship based on the co-occurrence or similarity of the co-occurrence words in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. Calculate the degree.

The representative character string extraction unit 30 calculates the score of the character string included in the target document set based on the relevance calculated by the background topic word extraction unit 20, and sets the character string having a high score as the representative character string. .

FIG. 5 is a flowchart showing an operation procedure when the time-series document summarization apparatus according to the embodiment of the present invention performs time-series document summarization processing.

Referring to FIG. 5, first, the document-of-interest topic word extraction unit 10 receives an input of a document set with time information from the user (step S1).

Next, the target document topic word extraction unit 10 sets the input document set with time information as the target document set. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it (step S2).

Next, the background topic word extraction unit 20 sets a document set different from the target document set as a reference document set. The background topic word extraction unit 20 extracts, from the reference document set, a feature word representing a topic in a period before the target document set period as a background topic word. Then, the background topic word extraction unit 20 calculates a relevance level representing the relevance between the target document topic word and the background topic word output from the target document topic word extraction unit 10, A topic word is output (step S3).

Next, the representative character string extracting unit 30 adds the background topic word and the calculation extracted by the background topic word extracting unit 20 in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extracting unit 10. The representative character string representing the topic of the target document set is extracted using the degree of relevance (step S4).

Here, the operation of step S1 will be specifically described. In the present embodiment, the user inputs a document set with time information to the target document topic word extraction unit 10 using the keyboard 108 or the like.

Note that the user may input the document set with time information to the target document topic word extraction unit 10 by an external computer connected to the time-series document summarizing apparatus 201 via the communication interface 107 and the network. . Alternatively, the user may input a document set with time information by designating a data file storing the document set with time information. In this case, the target document topic word extraction unit 10 reads a document set with time information from a data file designated by the user.

Next, the operation of step S2 will be specifically described. In the present embodiment, the document-of-interest topic word extraction unit 10 sets the input document set with time information as the document-of-interest collection. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it.

Here, there are various methods for extracting feature words representing topics of the target document set. For example, for each word, the number of occurrences in the document for that period is counted, and the words are ranked in the order of their appearance. Then, the top N words can be regarded as feature words that appear biased in that period.

In addition, various known feature word extraction techniques can be used as a method for extracting feature words representing topics of a target document set. For example, a feature word of a document may be extracted using the technique described in pages 22 to 23 of Non-Patent Document 3.

FIG. 6 is a diagram illustrating an example of data output from the document-of-interest topic word extraction unit 10.

Referring to FIG. 6, in this example, a set of documents posted on a microblog from 16:00 to 20:00 is used as a target document set, and topic words included in this target document set are extracted.

Next, the operation of step S3 will be specifically described. The background topic word extraction unit 20 sets a document set different from the target document set as a reference document set. The background topic word extraction unit 20 extracts, from the reference document set, feature words representing topics in a period before the target document set period as background topic words. Then, the background topic word extraction unit 20 calculates a relevance level representing the relevance between the target document topic word and the background topic word output from the target document topic word extraction unit 10, Output topic words.

Here, as the reference document set, a set of documents that are expected to include a topic that is earlier than the topic of the target document set is used. As a set of documents expected to include the past topics, a set of documents created or released in the past than the target document set can be used.

For example, it is assumed that the input document set of interest is a set of documents posted from 16:00 to 20:00 on a microblog. At this time, as a reference document set, for example, a set of documents posted on the same microblog between 0 o'clock and 16 o'clock can be used.

Alternatively, a document source different from the microblog to which the target document set belongs, such as a news article and another blog, may be used. However, even when another document source is used, it is necessary to be a document set that is expected to include a past topic from the time to which the target document set belongs.

In addition, if the reference document set is a set of documents that are expected to include topics that are earlier than the topic of the target document set, the time when the reference document set was created or published is It may be far from the creation or publication time of, or may overlap. For example, in the above-described example, as a reference document set, a set of documents posted from 0 o'clock to 6 o'clock may be used, or a set of documents posted from 3 o'clock to 18 o'clock may be used.

The background topic word extraction unit 20 extracts feature words representing topics in a period before the target document set period as background topic words from the reference document set. As the background topic word extraction method, the same method as the target document topic word extraction unit 10 extracting the target document topic word from the target document set may be used, or a different method may be used.

In the simplest case, the same method as that in which the target document topic word extraction unit 10 extracts the target document topic word from the target document set is applied to the reference document set. As a result, a feature word representing a topic in a period earlier than the period of the target document set can be extracted as a background topic word.

Further, the reference document set is further divided into several periods, and the same method as that in which the target document topic word extraction unit 10 extracts the target document topic word from the target document set is applied to each divided document set. You may do it.

For example, when a set of documents posted between 0 o'clock and 16 o'clock is used as the reference document set, “0 o'clock to 4 o'clock”, “4 o'clock to 8 o'clock”, “8 o'clock −12” It may be divided into documents posted during four periods of “hour” and “12: 00-16: 00”, and feature words of each document set may be extracted as background topic words.

After the background topic word is extracted as described above, the background topic word extraction unit 20 calculates the relevance level representing the relationship between the target document topic word and the background topic word output by the target document topic word extraction unit 10. calculate.

As the degree of association representing the relationship between the target document topic word and the background topic word, various things can be considered. Below, an example of a value considered as a relevance degree representing a relevance between A and B will be described, where the target document topic word and the background topic word are A and B, respectively.

The strength of co-occurrence in which two words appear in the document may be used as the degree of association representing the relation between the target document topic word and the background topic word.

For example, let N1 be the number of documents in which both word A and word B appear in the document set, and N2 be the number of documents in which either word A or word B appears. Then, N1 / N2 can be a degree of relevance representing the relevance between two words. The larger the value, the stronger the two words appear together. As a method for counting the number of documents, only the number of documents in the target document set may be counted, or the number of documents in the target document set and the reference document set may be combined. Although the accuracy is inferior to these, only the number of documents in the reference document set may be counted.

Also, as the degree of association representing the relationship between the topic word of interest and the background topic word, the similarity between the co-occurrence word of the subject document topic word and the co-occurrence word of the background topic word, specifically, the subject document topic Similarity between the context in which the word appears and the context in which the background topic word appears may be used.

That is, for a word A and a word B, a vector having a length Nw representing each context can be considered, where Nw is the total number of all words. Each element of the vector represents the number of times that a word co-occurs with the word A or the word B. At this time, by calculating the cosine similarity between the vector representing the context of the word A and the vector representing the context of the word B, the similarity between the contexts of the words A and B can be obtained. This similarity may be used as a degree of relation representing the relation between two words.

Also, the presence / absence of relevance in a dictionary describing the relevance of words may be used as the relevance level representing the relevance between the topic word of interest and the background topic word.

For example, when a tree-structured thesaurus representing the upper and lower relations of words is obtained, the reciprocal of the distance between nodes representing two words in the thesaurus tree structure is represented as the relationship between the two words. It is good also as the degree of relation to represent.

Also, as the relevance level representing the relevance between the target document topic word and the background topic word, the temporal appearance closeness may be used.

For example, let Ta be the average time of creation or publication of a document in which word A appears, and Tb be the average time of creation or publication of a document in which word B appears. At this time, the reciprocal of the temporal distance between Ta and Tb may be used as the degree of association representing the relationship between two words.

Also, a value obtained by combining the above-mentioned various degrees of association may be used as the degree of association representing the relation between the target document topic word and the background topic word.

For example, when the relevance calculated using the co-occurrence strength of two words appearing in the document is V1, and the relevance calculated using the closeness of temporal appearance is V2, V1 and V2 Instead, V1 + V2 may be output as the relevance.

Also, when calculating the degree of association representing the relationship between the target document topic word and the background topic word, a value representing the characteristic word likelihood of the background topic word is calculated, and that value is taken into account in calculating the degree of association. May be.

For example, let the magnitude of the appearance frequency in the reference document set be V3 as a value representing the likelihood of a feature word in the reference document set. It may be considered that the larger the value is, the more important the background topic word is, and the degree of association of the background topic word may be highly evaluated by adding V3 to the degree of association based on another method.

There are other known techniques that are generally known in the field of natural language processing, as well as methods for calculating the degree of association between words. In the present embodiment, in order to calculate the relevance between the document topic word of interest and the background topic word, a degree of relevance by such a known technique may be used.

FIG. 7 is a diagram illustrating an example of data output from the background topic word extraction unit 20.

In FIG. 7, the degree of relevance representing the relevance between the target document topic word and the background topic word is described. In FIG. 7, the vertical column represents the document topic word of interest, and the horizontal column represents the background topic word.

This example is based on the following assumptions. That is, a set of documents posted on a microblog from 16:00 to 20:00 is set as a target document set. A set of documents posted from 0 o'clock to 16 o'clock is set as a reference document set, and “4 o'clock to 4 o'clock”, “4 o'clock to 8 o'clock”, “8 o'clock to 12 o'clock”, and “12 o'clock to 16 o'clock” are displayed. The document is divided into documents posted in one period, and feature words of each document set are extracted as background topic words. Further, a relevance level representing the relevance between the target document topic word and the background topic word is calculated.

As shown in the example of FIG. 7, the degree of relevance with a background topic word representing a topic that is a background for the target document topic word such as “heavy rain” and “heavy rain” is calculated to be high. On the other hand, the degree of relevance to background topic words that do not represent the background topic for the target document topic word such as “electronic book” and “Democratic Party” is calculated low.

Next, the operation of step S4 will be specifically described. The representative character string extraction unit 30 adds the background topic word extracted by the background topic word extraction unit 20 and the calculated degree of relevance in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extraction unit 10. Is used to extract a representative character string representing the topic of the document set of interest.

Specifically, among the character strings included in the documents in the target document set, include any one of the target document topic words, and include any one of the background topic words highly related to the target document topic word A summary score indicating the goodness of the character string as a summary sentence is assigned to such a character string. Then, a character string having a high summary score is extracted as a representative character string representing the topic of the document set of interest.

The method of determining the character string to be extracted is arbitrary. For example, all the sentences included in the documents in the target document set can be obtained by dividing all the documents in the target document set with symbols representing sentence breaks such as punctuation marks.

The set of these sentences may be a character string to be extracted. Further, by dividing all documents in the target document set into every N characters (N is an integer of 2 or more), a set of character strings having a length of N characters can be obtained. A set of character strings having a length of N characters may be a character string to be extracted.

As a method for calculating a summary score of a character string, for example, only a character string including any one of the target document topic words is selected, and for each of the background topic words included in the selected character string, the target document topic is selected. The sum of the relevance between words may be used as a summary score. In addition, a method for selecting a summary character string from feature words as described in Non-Patent Document 3 may be used.

FIG. 8 is a diagram illustrating an example of a summary score of a character string in the representative character string extraction unit 30. FIG. 8 shows the summary score of the character strings included in the documents in the target document set when the documents in the period of “16: 00-20: 00” are set as the target document set.

The first column in FIG. 8 is a character string included in the documents in the target document set. The second column is a document topic word of interest included in the character string. The third column is a background topic word included in the character string and its degree of association. The fourth column is a summary score of the character string calculated based on the third column.

In FIG. 8, the character string “Kinkakuji is flooded due to heavy rain” has the highest summary score. This is because it includes a background topic word “high rain” that is highly relevant to the topic word of interest document. Such a sentence is considered to be a summary sentence including an explanation of a topic as a background.

On the other hand, the character string “Kinkakuji is supposed to be dangerous” includes two topic words of interest, but does not include background topic words, so the summary score of the character string is low. Such a character string is considered to be a summary sentence that does not include an explanation of the background topic.

On the other hand, the character string “I was surprised by the heavy rain” includes the background topic word “heavy rain”, but the summary score of the character string is not given. This is because even if a background topic word is included, a character string that does not include the target topic word is considered not suitable as a summary of the topic in the target period.

As a result, the character string “Kinkakuji is submerged due to heavy rain” is selected as the representative character string when the document in the period of “16: 00-20: 00” is the target document set.

FIG. 9 is a diagram illustrating an example of data output by the representative character string extraction unit 30. In this example, a representative character string is displayed when a document in a period from 16:00 to 20:00 is set as a target document set.

In FIG. 9, the representative character string includes a related background topic word “heavy rain”. Thereby, compared with the example shown in FIG. 2, the sentence including the explanation of the topic as a background is output. Moreover, the topic of the target document set is summarized by including the target document topic word “Kinkakuji”.

As described above, according to the time-series document summarizing apparatus 201 according to the present embodiment, topics in a certain period are summarized from a large amount of documents having time information, and the background topics are explained. A summary sentence can be output.

By the way, in the case of using a conventional technique such as selecting sentences containing feature words as summary sentences based on the statistical appearance tendency of words or expressions, there is a sentence that does not include a part that explains the background stochastically. Easy to select as a summary sentence. However, for general readers who do not know the original background, there is a problem that it is not appropriate as a summary sentence because it cannot understand what the sentence is written about.

On the other hand, in the time-series document summarization apparatus according to the embodiment of the present invention, the background topic word extraction unit 20 includes a target document set, a set of target document topic words that are feature words of the target document set, and a target A reference document set, which is a document set different from the document set, is acquired, and background topic words representing topics serving as backgrounds of topics described in the document set of interest are extracted from the reference document set. Then, the representative character string extracting unit 30 extracts a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set.

Here, specific differences between the techniques described in Patent Documents 1 to 3 and the time-series document summarization apparatus according to the embodiment of the present invention include the following points.

That is, in the technique described in Patent Document 1, these topic words are combined when the degree of document sharing between the topic words is high. That is, topic words that frequently appear in the same document are combined. For this reason, since the target document set and the document set different from the target document set are not distinguished, the two types of the target document topic word and the background topic word cannot be distinguished and extracted.

In contrast, in the time-series document summarization apparatus according to the embodiment of the present invention, a document set different from the document set of interest is prepared and feature words are extracted, and the extracted feature words are used as background topic words. Then, a character string including two types of background topic word and target document topic word is extracted from the target document set.

In the technique described in Patent Document 2, the degree of association between transmission sources is calculated from the similarity of word groups included in documents created by each transmission source in the past. In the technique described in Patent Document 3, the appearance frequency of each word at each time is totaled, and only words whose appearance frequency greatly increases at any part of the period are extracted as potential topic candidate words. As described above, the techniques described in

Patent Documents

2 and 3 provide a background topic representing a topic that is the background of the topic described in the target document set, like the time-series document summarization device according to the embodiment of the present invention. This is completely different from the configuration for extracting words from the reference document set.

That is, in the time-series document summarization device according to the embodiment of the present invention, not only the feature word included in the target document set, that is, the target document topic word, but also the character representing the background topic, that is, the character further including the background topic word A column is extracted from the character strings included in the target document set and extracted as a representative character string. More specifically, a document set different from the target document set is prepared, a feature word of this document set is extracted as a background topic word, and a character string including two types of the background topic word and the target document topic word is selected as the target document. Extract from set.

That is, among the constituent elements in the time-series document summarizing apparatus according to the embodiment of the present invention, an appropriate summary sentence is collected from a set of documents by the minimum configuration including the background topic word extraction unit 20 and the representative character string extraction unit 30. It is possible to achieve the object of the present invention to output.

In the time-series document summarization apparatus according to the embodiment of the present invention, the background topic word extraction unit 20 acquires a document set including documents created or released in the past as a reference document set rather than the target document set. .

With such a configuration, it is possible to acquire a document set that is more likely to include a past topic than the topic of the document set of interest, and to acquire an appropriate background topic word.

Also, in the time-series document summarization apparatus according to the embodiment of the present invention, the background topic word extraction unit 20 extracts many words included in the reference document set or words included in a biased manner as background topic words.

With such a configuration, an appropriate background topic word can be more reliably acquired from the reference document set. That is, it is possible to acquire words related to contents that have been discussed to some extent in the past as background topic words.

Also, in the time-series document summarization apparatus according to the embodiment of the present invention, the background topic word extraction unit 20 calculates the degree of association between the target document topic word and the background topic word. Then, the representative character string extraction unit 30 calculates the score of the character string included in the target document set based on the relevance calculated by the background topic word extraction unit 20, and determines the character string having a high score as the representative character string. And

With such a configuration, it is possible to quantitatively evaluate the character strings included in the target document set and extract an appropriate representative character string. That is, it is possible to acquire a word related to the content currently being discussed as a background topic word.

Further, in the time-series document summarization apparatus according to the embodiment of the present invention, the background topic word extraction unit 20 includes in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. Relevance is calculated based on co-occurrence or similarity of co-occurrence words.

With such a configuration, it is possible to appropriately calculate the score of the character string included in the target document set.

Further, in the time-series document summarization apparatus according to the embodiment of the present invention, the target document topic word extraction unit 10 acquires the target document set, and focuses on a word representing the topic of the target document set included in the target document set. Extracted as document topic words. Then, the background topic word extraction unit 20 acquires the target document topic word extracted by the target document topic word extraction unit 10.

With such a configuration, the target document set and the target document topic word can be automatically acquired, and the apparatus can function more comprehensively as a device for creating a summary sentence of the target document set.

Although the time series document summarization apparatus according to the embodiment of the present invention is configured to include the target document topic word extraction unit 10, the present invention is not limited to this. The configuration may be such that the topic topic word extraction unit 20 does not include the target document topic word extraction unit 10 and the background topic word extraction unit 20 acquires a set of the target document set and the target document topic word from outside the time-series document summarization apparatus 201. For example, the time-series document summarization apparatus 201 may be configured to accept designation of a set of a target document set and a target document topic word from a user.

Some or all of the above embodiments can be described as the following supplementary notes, but the scope of the present invention is not limited to the following supplementary notes.

[Appendix 1]
A time-series document summarization device for outputting a summary sentence of a target document set which is a target document set,
The target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set. A background topic word extraction unit for extracting a background topic word representing a topic that is a background of a topic that is from the reference document set;
A representative character string extraction unit for extracting a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set; , Time-series document summarization device.

[Appendix 2]
The time series document summarization device according to appendix 1, wherein the background topic word extraction unit acquires a document set including documents created or released in the past as the reference document set as the reference document set.

[Appendix 3]
The time-series document summarization device according to appendix 2, wherein the background topic word extraction unit extracts words included in the reference document set in large numbers or words included in a biased manner as the background topic words.

[Appendix 4]
The background topic word extraction unit calculates a degree of association between the target document topic word and the background topic word,
The representative character string extracting unit calculates a score of a character string included in the target document set based on the relevance calculated by the background topic word extracting unit, and the character string having a high score is represented by the representative character string. 4. The time-series document summarization device according to any one of appendices 1 to 3, which is a character string.

[Appendix 5]
The background topic word extraction unit is based on co-occurrence or similarity of co-occurrence words in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. The time-series document summarization device according to appendix 4, which calculates the relevance level.

[Appendix 6]
The time-series document summarization apparatus further includes:
A document-of-interest topic word extraction unit for acquiring the document-of-interest collection and extracting a word representing the topic of the document-of-interest document included in the document-of-interest collection as the document-of-interest topic word;
The time series document summarization device according to any one of appendices 1 to 5, wherein the background topic word extraction unit acquires the target document topic word extracted by the target document topic word extraction unit.

[Appendix 7]
A time-series document summarization method for outputting a summary sentence of a target document set which is a target document set,
The target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set. Extracting a background topic word representing a topic that is a background of a topic that is from the reference document set;
A time-series document summarization method comprising: extracting a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set. .

[Appendix 8]
The time-series document summarizing method according to appendix 7, wherein in the step of extracting the background topic word, a document set including documents created or released in the past than the target document set is acquired as the reference document set.

[Appendix 9]
9. The time-series document summarizing method according to appendix 8, wherein in the step of extracting the background topic word, a plurality of words included in the reference document set or words included in a biased manner are extracted as the background topic word.

[Appendix 10]
In the step of extracting the background topic word, the degree of association between the target document topic word and the background topic word is calculated,
In the step of extracting the representative character string, the score of the character string included in the target document set is calculated based on the calculated relevance, and the character string having a high score is set as the representative character string. The time-series document summarization method according to any one of appendices 7 to 9.

[Appendix 11]
In the step of extracting the background topic word, based on the co-occurrence or co-occurrence word similarity in the document of the target document topic word and the background topic word in the target document set or the reference document set The time-series document summarization method according to attachment 10, wherein the relevance is calculated.

[Appendix 12]
The above time series document summarization method further includes:
Obtaining the target document set, and extracting a word representing the topic of the target document set included in the target document set as the target document topic word,
12. The time-series document summarization method according to any one of appendices 7 to 11, wherein in the step of extracting the background topic word, the extracted document topic word of interest is acquired.

[Appendix 13]
A computer-readable recording medium on which a time-series document summarization program used in a time-series document summarization apparatus for outputting a summary sentence of a target document set, which is a target document set, is recorded. The program is on the computer
The target document set, a target document topic word set that is a characteristic word of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set. Extracting a background topic word representing a topic that is a background of a topic that is from the reference document set;
A program for executing a step of extracting a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set. Computer-readable recording medium.

[Appendix 14]
The computer-readable recording medium according to appendix 13, wherein in the step of extracting the background topic word, a document set including documents created or released in the past than the target document set is acquired as the reference document set. .

[Appendix 15]
15. The computer-readable recording medium according to appendix 14, wherein in the step of extracting the background topic word, a plurality of words included in the reference document set or words included in a biased manner are extracted as the background topic word.

[Appendix 16]
In the step of extracting the background topic word, the degree of association between the target document topic word and the background topic word is calculated,
In the step of extracting the representative character string, the score of the character string included in the target document set is calculated based on the calculated relevance, and the character string having a high score is set as the representative character string. The computer-readable recording medium according to any one of appendices 13 to 15.

[Appendix 17]
In the step of extracting the background topic word, based on the co-occurrence or co-occurrence word similarity in the document of the target document topic word and the background topic word in the target document set or the reference document set The computer-readable recording medium according to attachment 16, wherein the relevance is calculated.

[Appendix 18]
The time-series document summarization program is further stored in a computer.
A program for acquiring the target document set and executing a step of extracting a word representing a topic of the target document set included in the target document set as the target document topic word.
18. The time-series document summarization program according to any one of appendices 13 to 17, wherein in the step of extracting the background topic word, the extracted target document topic word is acquired.

It should be considered that the above embodiment is illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

This application claims priority based on Japanese Patent Application No. 2011-29705 filed on February 15, 2011, the entire disclosure of which is incorporated herein.

According to the present invention, for example, in a microblog, it is possible to output a summary sentence that summarizes a topic of a certain period from a large amount of documents having time information and includes an explanation of a background topic. Therefore, the present invention has industrial applicability.

DESCRIPTION OF SYMBOLS 10 Document document topic word extraction part 20 Background topic word extraction part 30 Representative character string extraction part 101 CPU
102 Main Memory 103 Hard Disk 104 Input Interface 105 Display Controller 106 Data Reader / Writer 107 Communication Interface 108 Keyboard 109 Mouse 110 Display 111 Recording Medium 121 Bus 201 Time Series Document Summarization Device

Claims

A time-series document summarization device for outputting a summary sentence of a target document set which is a target document set,
The target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set A background topic word extraction unit for extracting a background topic word representing a topic that is a background of a topic that is being extracted from the reference document set;
A representative character string extracting unit for extracting a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set; , Time-series document summarization device.
2. The time-series document summarization device according to claim 1, wherein the background topic word extraction unit acquires a document set including documents created or released in the past from the target document set as the reference document set. .
3. The time-series document summarization device according to claim 2, wherein the background topic word extraction unit extracts a plurality of words included in the reference document set or words included in a biased manner as the background topic words.
The background topic word extraction unit calculates a degree of association between the target document topic word and the background topic word,
The representative character string extraction unit calculates a score of a character string included in the target document set based on the relevance calculated by the background topic word extraction unit, and the character string having a high score is represented by the representative character string. The time-series document summarization device according to any one of claims 1 to 3, wherein the time-series document summarization device is a character string.
The background topic word extraction unit is based on the co-occurrence or similarity of co-occurrence words in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. 5. The time-series document summarization apparatus according to claim 4, wherein the relevance is calculated.
The time-series document summarization device further includes:
A document-of-interest topic word extraction unit for acquiring the document-of-interest document and extracting a word representing the topic of the document-of-interest document included in the document-of-interest document as the document-of-interest topic word;
The time-series document summarization device according to any one of claims 1 to 5, wherein the background topic word extraction unit acquires the target document topic word extracted by the target document topic word extraction unit.
A time-series document summarization method for outputting a summary sentence of a target document set which is a target document set,
The target document set, a target document topic word set that is a characteristic word of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set Extracting a background topic word representing a topic that is a background of a topic being read from the reference document set;
Extracting a representative character string including the topic word of interest and the background topic word from a character string included in the document of interest as a summary sentence of the document of interest, .
A computer-readable recording medium that records a time-series document summarization program used in a time-series document summarization apparatus for outputting a summary sentence of a target document set that is a target document set, the time-series document summary The program is on the computer
The target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set Extracting a background topic word representing a topic that is a background of a topic being read from the reference document set;
A program for executing a step of extracting a representative character string including the target document topic word and the background topic word as a summary sentence of the target document set from character strings included in the target document set. Computer-readable recording medium.