WO2012111226A1 - 時系列文書要約装置、時系列文書要約方法およびコンピュータ読み取り可能な記録媒体 - Google Patents
時系列文書要約装置、時系列文書要約方法およびコンピュータ読み取り可能な記録媒体 Download PDFInfo
- Publication number
- WO2012111226A1 WO2012111226A1 PCT/JP2011/078517 JP2011078517W WO2012111226A1 WO 2012111226 A1 WO2012111226 A1 WO 2012111226A1 JP 2011078517 W JP2011078517 W JP 2011078517W WO 2012111226 A1 WO2012111226 A1 WO 2012111226A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- topic
- document set
- target document
- topic word
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Definitions
- the present invention relates to a time-series document summarization apparatus, a time-series document summarization method, and a computer-readable recording medium, and more particularly to a time-series document summarization apparatus, a time-series document summarization method, and a method for summarizing topics in a document set and presenting them to a user.
- the present invention relates to a computer-readable recording medium.
- Trend analysis technology is known as a technology for extracting and summarizing matters that have become a hot topic from a large amount of time-series documents.
- Trend analysis is a technology that analyzes what is being talked about in each period from a large number of documents generated in time series, such as news articles and blog articles, and presents it to the user. .
- Non-patent Document 1 Okumura Manabu, Minano Yasuyuki, Fujiki Yasuaki, Suzuki Yasuhiro, “Text Mining Based on Automatic Collection and Monitoring of Blog Pages”, Technology described in the Japanese Society for Artificial Intelligence SIG-SW & ONT-A401-01, 2004 (Non-patent Document 1) Then, by determining whether or not the appearance interval of a document including a certain word is shorter than usual, feature words that appear more frequently in a specific period are extracted.
- a sentence including the feature word for the feature word of the target period extracted using the technique described in Non-Patent Document 1 can be output as a summary sentence representing the topic in that period.
- Non-patent Document 2 "Yahoo! Blog Search", [online], [August 23, 2010 search], Internet ⁇ URL: http://blog-search.yahoo.co.jp/> (Non-patent Document 2)
- a feature word at the current time is displayed on the top page, and when the displayed feature word is clicked, a transition is made to a search page and a part of a sentence including the clicked feature word is displayed. This is equivalent to presenting to the user a sentence including a feature word in the period of interest as a sentence for explaining the topic in that period.
- Non-Patent Document 3 extracts sentences including feature words of documents. This is a technique for creating a summary. By applying this technique to a set of documents belonging to a certain period, it is possible to present a summary sentence that explains the topic of that period.
- Patent Document 1 discloses the following technique. That is, when a topic word and document information related to the topic word are read, a document related to a certain topic word and a document related to another topic word are determined by the topic word combination rule stored in the topic word combination storage unit. The degree of document sharing with is calculated. Next, topic words that can be combined are selected based on the document sharing level, and the selected topic words are combined to form a topic word group together with the document sharing level. Next, based on the representative word extraction rule, the representative words of the combined topic word groups are extracted.
- Patent Document 2 discloses the following technique.
- the word obtained by acquiring the degree of relevance between the source of the processing target document and the source that has used the word from the relevance database and totaling it.
- Relevance distribution with the user, and relevance distribution with other transmission sources obtained by acquiring and totaling the relevance between the transmission source of the document to be processed and other transmission sources from the relevance database Contrast.
- the amount representing the degree of use of a large number of transmission sources having a high degree of association with the transmission source of the processing target document is set as the topic level of the word.
- Patent Document 3 discloses the following technique. That is, the time series frequency vector of each word is generated by counting the temporal change in appearance frequency of words appearing in a plurality of document sets. The time-series frequency vector of the generated word is analyzed, and a word whose frequency increases rapidly is extracted as a candidate word that is a candidate for a potential topic. A main topic time-series frequency vector is generated by quantifying the number of documents acquired every time for topics whose number of documents exceeds a predetermined threshold among topics included in the document set. Then, the inter-vector distance between the time series frequency vector of each candidate word and the main topic time series frequency vector is calculated, and a word having a large distance is extracted as a latent topic word.
- microblogging like Twitter By the way, a new service called microblogging like Twitter has begun to spread. In such a microblog, a user often posts a sentence assuming a reader who shares a specific small number of background information.
- sentences that do not include parts that explain the background stochastically are summarized sentences. Easy to be sorted as. However, for general readers who do not know the original background, there is a problem that it is not appropriate as a summary sentence because it cannot understand what the sentence is written about.
- Non-Patent Documents 1 to 3 and Patent Documents 1 to 3 do not disclose a configuration for solving such a problem.
- the present invention has been made to solve the above-described problems, and an object of the present invention is to provide a time-series document summarization apparatus, a time-series document summarization method, and a computer-readable computer that can output an appropriate summary sentence from a set of documents. It is to provide a possible recording medium.
- a time-series document summarization apparatus for outputting a summary sentence of a target document set that is a target document set.
- a target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set.
- a background topic word extracting unit for extracting a background topic word representing a topic as a background of a topic from the reference document set, and from the character string included in the target document set,
- a representative character string extracting unit for extracting a representative character string including a background topic word as a summary sentence of the target document set;
- a time-series document summarization method is a time-series document summarization method for outputting a summary sentence of a target document set which is a target document set, and the target document A topic and a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set, and a topic described in the target document set
- a step of extracting a background topic word representing a topic as a background of the reference document set from the reference document set, and a representative character string including the target document topic word and the background topic word from the character strings included in the target document set Is extracted as a summary sentence of the target document set.
- a computer-readable recording medium is used in a time-series document summarization apparatus for outputting a summary sentence of a target document set which is a target document set.
- the string is a program for executing a step of extracting as a summary of the interest document set.
- an appropriate summary sentence can be output from a set of documents.
- FIG. 1 is a schematic configuration diagram of a time-series document summarizing device according to an embodiment of the present invention. It is a block diagram which shows the control structure which the time series document summarization apparatus which concerns on the 1st Embodiment of this invention provides. It is a flowchart which shows the operation
- Human sentences are considered to consist of two parts. That is, a part explaining “background” indicating what the sentence describes and a part explaining “new information” that the writer wants to convey by the sentence. This is not limited to text written in writing, but is also true for verbal utterances.
- background refers to the pre-requisite topics and the objects to be described that are necessary for understanding the text.
- new information refers to matters that the author wants to assert through the text, such as descriptions of new facts, opinions and impressions, etc., regarding the topic and subject matter explained as background.
- new information is generically used here, but this “new information” refers to information that the author wants to convey to the reader or information that the author wants to claim, and is not necessarily completely unknown to the reader. It does not have to be limited to information.
- the main part I want to convey through the text is the explanation of the new information. Since the description of the background is not new information, it can be omitted when the information is transmitted to a specific partner who already shares the background information.
- the news article assumes an unspecified number of readers who do not always share background information, so “Japan won 3 vs 1 in Japan vs. Denmark in the Soccer World Cup. ”Describes new information after explaining the background.
- microblogging is a service that allows individuals to post their own texts, just like blogs. The user can post a short sentence of about 140 characters at the maximum. With microblogging, people can easily post what they thought of on the Internet in real time.
- microblogs contain a large number of sentences that are intended for a specific number of readers when a large number of sentences posted on microblogs are accumulated, compared to the accumulation of conventional news articles and blogs. It is thought that there is. And in such a sentence, the part used as the description regarding a background is often abbreviate
- the microblog posts a lot of sentences that convey only the current new information, such as “Oh, I decided to shoot” and “I did it, goal” but omitted the explanation of the background.
- the contributors to these texts are posting to a small number of readers who share backgrounds that can guess what they are writing about. In many cases, it is assumed that the timing at which the posted text is read is not greatly deviated from the time of posting.
- FIG. 1 is a diagram showing an example of a topic of a day in a microblog.
- FIG. 2 is a diagram illustrating the feature words of each period and a sentence including the feature words in the example of FIG.
- FIGS. 1 and 2 illustrate changes in topics in a set of documents posted during a day on a microblog.
- One day is divided into six periods every four hours, and for each period, one sentence summarizing topics included in documents posted in that period is output. Therefore, it is assumed that a total of six summary sentences are output per day.
- Fig. 1 shows the result of a human worker reading and analyzing a posted document and examining what has become a topic. This day was the day when various parts of Japan were hit by heavy rain, and it was filled with topics related to heavy rain in three time zones: “4 am-8pm”, “12 am-16:00” and “16: 00-20am”. I understand that.
- FIG. 2 shows the result of extracting feature words in each period and a sentence including the feature words for the same document set as FIG.
- the sentence shown in FIG. 2 has not been able to output a summary sentence including an explanation of a topic that is the background of heavy rain.
- This method cannot output a summary sentence that includes the explanation of the topic that is the background.
- it is necessary to include the feature word for the period of interest. It is because it considers only. For this reason, it is necessary to further add a condition that becomes a summary sentence including the explanation of the background topic.
- the time-series document summarization apparatus uses the characteristic words of the past period as a clue rather than the period of interest. As a result, it is possible to output a summary sentence that summarizes the topic of a certain period and includes the explanation of the topic as a background from a large amount of documents having time information.
- the time-series document summarization apparatus 201 typically has a computer having a general-purpose architecture as a basic structure, and executes various programs as will be described later by executing a preinstalled program. Provide functionality. Generally, such a program is stored in a recording medium such as a flexible disk and a CD-ROM (Compact Disk Read Only Memory) or distributed via a network or the like.
- an OS Operating System
- an OS for providing basic functions of the computer is provided in addition to the application for providing the functions according to the embodiment of the present invention. It may be installed.
- the program according to the embodiment of the present invention executes processing by calling necessary modules out of program modules provided as a part of the OS in a predetermined order and / or timing. May be. That is, the program itself according to the embodiment of the present invention does not include the module as described above, and the process may be executed in cooperation with the OS. Therefore, the program according to the embodiment of the present invention may have a form that does not include the above-described module.
- the program according to the embodiment of the present invention may be provided by being incorporated in a part of another program such as an OS. Even in this case, the program itself according to the embodiment of the present invention does not include a module included in the other program as described above, and the process is executed in cooperation with the other program. That is, the program according to the embodiment of the present invention may be in a form incorporated in such another program.
- program execution may be implemented as a dedicated hardware circuit.
- FIG. 3 is a schematic configuration diagram of the time-series document summarizing apparatus according to the embodiment of the present invention.
- time-series document summarization apparatus 201 is an information processing apparatus such as a portable information terminal, personal computer, and server, and includes a CPU (Central Processing Unit) 101 that is an arithmetic processing unit, a main memory 102, and a hard disk. 103, an input interface 104, a display controller 105, a data reader / writer 106, and a communication interface 107. These units are connected to each other via a bus 121 so that data communication is possible.
- the CPU 101 performs various operations by developing programs (codes) stored in the hard disk 103 in the main memory 102 and executing them in a predetermined order.
- the main memory 102 is typically a volatile storage device such as a DRAM (Dynamic Random Access Memory), and stores data indicating various arithmetic processing results in addition to the program read from the hard disk 103. To do.
- the hard disk 103 is a non-volatile magnetic storage device, and stores various setting values in addition to programs executed by the CPU 101.
- the program installed in the hard disk 103 is distributed in a state of being stored in the recording medium 111 as will be described later.
- a semiconductor storage device such as a flash memory may be employed.
- the input interface 104 mediates data transmission between the CPU 101 and an input unit such as a keyboard 108, a mouse 109, and a touch panel (not shown). That is, the input interface 104 receives an external input such as an operation command given by the user operating the input unit.
- an input unit such as a keyboard 108, a mouse 109, and a touch panel (not shown). That is, the input interface 104 receives an external input such as an operation command given by the user operating the input unit.
- the display controller 105 is connected to a display 110 that is a typical example of a display unit, and controls display on the display 110. That is, the display controller 105 displays the result of image processing by the CPU 101 to the user.
- the display 110 is, for example, an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube).
- the data reader / writer 106 mediates data transmission between the CPU 101 and the recording medium 111. That is, the recording medium 111 circulates in a state where a program executed by the time-series document summarizing apparatus 201 is stored, and the data reader / writer 106 reads the program from the recording medium 111. Further, the data reader / writer 106 writes the processing result in the time-series document summarizing apparatus 201 into the recording medium 111 in response to the internal command of the CPU 101.
- the recording medium 111 may be, for example, a general-purpose semiconductor storage device such as CF (Compact Flash) and SD (Secure Digital), a magnetic storage medium such as a flexible disk, or a CD-ROM (Compact Disk Read Only). Memory).
- the communication interface 107 mediates data transmission between the CPU 101, a personal computer, a server device, and the like.
- the communication interface 107 typically has an Ethernet (registered trademark) or USB (Universal Serial Bus) communication function.
- Ethernet registered trademark
- USB Universal Serial Bus
- time series document summarization apparatus 201 may be connected to another output device such as a printer as necessary.
- FIG. 4 is a block diagram showing a control structure provided by the time-series document summarizing apparatus according to the first embodiment of the present invention.
- FIG. 4 is provided by developing a program (code) stored in the hard disk 103 in the main memory 102 and causing the CPU 101 to execute it. Note that some or all of the modules shown in FIG. 4 may be provided by firmware implemented in hardware. Alternatively, part or all of the control structure shown in FIG. 4 may be realized by dedicated hardware and / or a wiring circuit.
- the time-series document summarizing apparatus 201 includes a target document topic word extraction unit 10, a background topic word extraction unit 20, and a representative character string extraction unit 30 as its control structure.
- the time-series document summarizing apparatus 201 accepts a document set with time information as an input.
- a document set with time information is a set of documents in which documents included in the set are associated with some time.
- the time associated with each document represents the time when the document was created, the time when it was transmitted, and the like.
- the time may be described in any granularity such as year, month, day, hour, minute, and second.
- Examples of document sets with time information received as input by the time-series document summarization apparatus 201 include news articles, blogs, microblogs, and documents posted on electronic bulletin boards.
- the time series document summarization apparatus 201 summarizes the topics of the input document set. This input document set is called a target document set. That is, the time-series document summarization apparatus 201 creates a summary sentence of a target document set that is a target document set.
- the target document topic word extraction unit 10 sets the input document set with time information as the target document set. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it.
- the background topic word extraction unit 20 sets a document set different from the target document set as a reference document set.
- this document set is different from a document set that is a dictionary such as a term dictionary.
- the document set for reference may be a document set with time information or a document set without time information.
- the background topic word extraction unit 20 extracts, from the reference document set, feature words representing topics in the past period as the background topic word from the period of the document set of interest. Then, the background topic word extraction unit 20 calculates a relevance level representing the relevance between the extracted background topic word and the target document topic word output from the target document topic word extraction unit 10, and calculates the calculated relevance level. , And background topic words are output.
- the representative character string extraction unit 30 adds the background topic word extracted by the background topic word extraction unit 20 and the calculated relevance degree in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extraction unit 10. Is used to extract a representative character string representing the topic of the document set of interest.
- the document-of-interest topic word extraction unit 10 acquires the document-of-interest collection, and extracts a word representing the topic of the document-of-interest included in the document-of-interest collection as a document-of-interest topic word.
- the background topic word extraction unit 20 is a document set that is different from the target document set and the set of target document topic words that are characteristic words of the target document set extracted by the target document topic word extraction unit 10. Get reference document set.
- the background topic word extraction unit 20 acquires, as a reference document set, a document set including documents created or released in the past from the target document set.
- the background topic word extraction unit 20 extracts a background topic word representing a topic that is a background of the topic described in the target document set from the reference document set. For example, the background topic word extraction unit 20 extracts many words included in the reference document set or words included in a biased manner as background topic words.
- the representative character string extraction unit 30 extracts a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set.
- the background topic word extraction unit 20 calculates the degree of association between the target document topic word and the background topic word.
- the background topic word extraction unit 20 relates to the relationship based on the co-occurrence or similarity of the co-occurrence words in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. Calculate the degree.
- the representative character string extraction unit 30 calculates the score of the character string included in the target document set based on the relevance calculated by the background topic word extraction unit 20, and sets the character string having a high score as the representative character string. .
- FIG. 5 is a flowchart showing an operation procedure when the time-series document summarization apparatus according to the embodiment of the present invention performs time-series document summarization processing.
- the document-of-interest topic word extraction unit 10 receives an input of a document set with time information from the user (step S1).
- the target document topic word extraction unit 10 sets the input document set with time information as the target document set. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it (step S2).
- the background topic word extraction unit 20 sets a document set different from the target document set as a reference document set.
- the background topic word extraction unit 20 extracts, from the reference document set, a feature word representing a topic in a period before the target document set period as a background topic word.
- the background topic word extraction unit 20 calculates a relevance level representing the relevance between the target document topic word and the background topic word output from the target document topic word extraction unit 10, A topic word is output (step S3).
- the representative character string extracting unit 30 adds the background topic word and the calculation extracted by the background topic word extracting unit 20 in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extracting unit 10.
- the representative character string representing the topic of the target document set is extracted using the degree of relevance (step S4).
- step S1 the operation of step S1 will be specifically described.
- the user inputs a document set with time information to the target document topic word extraction unit 10 using the keyboard 108 or the like.
- the user may input the document set with time information to the target document topic word extraction unit 10 by an external computer connected to the time-series document summarizing apparatus 201 via the communication interface 107 and the network.
- the user may input a document set with time information by designating a data file storing the document set with time information.
- the target document topic word extraction unit 10 reads a document set with time information from a data file designated by the user.
- the document-of-interest topic word extraction unit 10 sets the input document set with time information as the document-of-interest collection. Then, the document-of-interest topic word extraction unit 10 extracts a feature word representing the topic of the document-of-interest collection as a document-of-interest topic word and outputs it.
- a feature word of a document may be extracted using the technique described in pages 22 to 23 of Non-Patent Document 3.
- FIG. 6 is a diagram illustrating an example of data output from the document-of-interest topic word extraction unit 10.
- a set of documents posted on a microblog from 16:00 to 20:00 is used as a target document set, and topic words included in this target document set are extracted.
- the background topic word extraction unit 20 sets a document set different from the target document set as a reference document set.
- the background topic word extraction unit 20 extracts, from the reference document set, feature words representing topics in a period before the target document set period as background topic words. Then, the background topic word extraction unit 20 calculates a relevance level representing the relevance between the target document topic word and the background topic word output from the target document topic word extraction unit 10, Output topic words.
- the reference document set a set of documents that are expected to include a topic that is earlier than the topic of the target document set is used.
- a set of documents expected to include the past topics a set of documents created or released in the past than the target document set can be used.
- the input document set of interest is a set of documents posted from 16:00 to 20:00 on a microblog.
- a reference document set for example, a set of documents posted on the same microblog between 0 o'clock and 16 o'clock can be used.
- a document source different from the microblog to which the target document set belongs such as a news article and another blog, may be used.
- another document source it is necessary to be a document set that is expected to include a past topic from the time to which the target document set belongs.
- the reference document set is a set of documents that are expected to include topics that are earlier than the topic of the target document set
- the time when the reference document set was created or published is It may be far from the creation or publication time of, or may overlap.
- a reference document set a set of documents posted from 0 o'clock to 6 o'clock may be used, or a set of documents posted from 3 o'clock to 18 o'clock may be used.
- the background topic word extraction unit 20 extracts feature words representing topics in a period before the target document set period as background topic words from the reference document set.
- the same method as the target document topic word extraction unit 10 extracting the target document topic word from the target document set may be used, or a different method may be used.
- the same method as that in which the target document topic word extraction unit 10 extracts the target document topic word from the target document set is applied to the reference document set.
- a feature word representing a topic in a period earlier than the period of the target document set can be extracted as a background topic word.
- the reference document set is further divided into several periods, and the same method as that in which the target document topic word extraction unit 10 extracts the target document topic word from the target document set is applied to each divided document set. You may do it.
- the background topic word extraction unit 20 calculates the relevance level representing the relationship between the target document topic word and the background topic word output by the target document topic word extraction unit 10. calculate.
- the degree of association representing the relationship between the target document topic word and the background topic word various things can be considered. Below, an example of a value considered as a relevance degree representing a relevance between A and B will be described, where the target document topic word and the background topic word are A and B, respectively.
- the strength of co-occurrence in which two words appear in the document may be used as the degree of association representing the relation between the target document topic word and the background topic word.
- N1 be the number of documents in which both word A and word B appear in the document set
- N2 be the number of documents in which either word A or word B appears.
- N1 / N2 can be a degree of relevance representing the relevance between two words. The larger the value, the stronger the two words appear together.
- a method for counting the number of documents only the number of documents in the target document set may be counted, or the number of documents in the target document set and the reference document set may be combined. Although the accuracy is inferior to these, only the number of documents in the reference document set may be counted.
- the degree of association representing the relationship between the topic word of interest and the background topic word
- the similarity between the co-occurrence word of the subject document topic word and the co-occurrence word of the background topic word specifically, the subject document topic Similarity between the context in which the word appears and the context in which the background topic word appears may be used.
- a vector having a length Nw representing each context can be considered, where Nw is the total number of all words.
- Nw is the total number of all words.
- Each element of the vector represents the number of times that a word co-occurs with the word A or the word B.
- This similarity may be used as a degree of relation representing the relation between two words.
- the presence / absence of relevance in a dictionary describing the relevance of words may be used as the relevance level representing the relevance between the topic word of interest and the background topic word.
- the reciprocal of the distance between nodes representing two words in the thesaurus tree structure is represented as the relationship between the two words. It is good also as the degree of relation to represent.
- the temporal appearance closeness may be used.
- Ta be the average time of creation or publication of a document in which word A appears
- Tb be the average time of creation or publication of a document in which word B appears.
- the reciprocal of the temporal distance between Ta and Tb may be used as the degree of association representing the relationship between two words.
- a value obtained by combining the above-mentioned various degrees of association may be used as the degree of association representing the relation between the target document topic word and the background topic word.
- V1 + V2 may be output as the relevance.
- a value representing the characteristic word likelihood of the background topic word is calculated, and that value is taken into account in calculating the degree of association. May be.
- the magnitude of the appearance frequency in the reference document set be V3 as a value representing the likelihood of a feature word in the reference document set. It may be considered that the larger the value is, the more important the background topic word is, and the degree of association of the background topic word may be highly evaluated by adding V3 to the degree of association based on another method.
- FIG. 7 is a diagram illustrating an example of data output from the background topic word extraction unit 20.
- FIG. 7 the degree of relevance representing the relevance between the target document topic word and the background topic word is described.
- the vertical column represents the document topic word of interest, and the horizontal column represents the background topic word.
- This example is based on the following assumptions. That is, a set of documents posted on a microblog from 16:00 to 20:00 is set as a target document set. A set of documents posted from 0 o'clock to 16 o'clock is set as a reference document set, and “4 o'clock to 4 o'clock”, “4 o'clock to 8 o'clock”, “8 o'clock to 12 o'clock”, and “12 o'clock to 16 o'clock” are displayed. The document is divided into documents posted in one period, and feature words of each document set are extracted as background topic words. Further, a relevance level representing the relevance between the target document topic word and the background topic word is calculated.
- the degree of relevance with a background topic word representing a topic that is a background for the target document topic word is calculated to be high.
- the degree of relevance to background topic words that do not represent the background topic for the target document topic word such as “electronic book” and “Democratic Party” is calculated low.
- the representative character string extraction unit 30 adds the background topic word extracted by the background topic word extraction unit 20 and the calculated degree of relevance in addition to the target document topic word representing the topic of the target document set extracted by the target document topic word extraction unit 10. Is used to extract a representative character string representing the topic of the document set of interest.
- the character strings included in the documents in the target document set include any one of the target document topic words, and include any one of the background topic words highly related to the target document topic word
- a summary score indicating the goodness of the character string as a summary sentence is assigned to such a character string.
- a character string having a high summary score is extracted as a representative character string representing the topic of the document set of interest.
- the method of determining the character string to be extracted is arbitrary.
- all the sentences included in the documents in the target document set can be obtained by dividing all the documents in the target document set with symbols representing sentence breaks such as punctuation marks.
- the set of these sentences may be a character string to be extracted. Further, by dividing all documents in the target document set into every N characters (N is an integer of 2 or more), a set of character strings having a length of N characters can be obtained. A set of character strings having a length of N characters may be a character string to be extracted.
- a method for calculating a summary score of a character string for example, only a character string including any one of the target document topic words is selected, and for each of the background topic words included in the selected character string, the target document topic is selected.
- the sum of the relevance between words may be used as a summary score.
- a method for selecting a summary character string from feature words as described in Non-Patent Document 3 may be used.
- FIG. 8 is a diagram illustrating an example of a summary score of a character string in the representative character string extraction unit 30.
- FIG. 8 shows the summary score of the character strings included in the documents in the target document set when the documents in the period of “16: 00-20: 00” are set as the target document set.
- the first column in FIG. 8 is a character string included in the documents in the target document set.
- the second column is a document topic word of interest included in the character string.
- the third column is a background topic word included in the character string and its degree of association.
- the fourth column is a summary score of the character string calculated based on the third column.
- the character string “Kinkakuji is flooded due to heavy rain” has the highest summary score. This is because it includes a background topic word “high rain” that is highly relevant to the topic word of interest document. Such a sentence is considered to be a summary sentence including an explanation of a topic as a background.
- the character string “Kinkakuji is supposed to be dangerous” includes two topic words of interest, but does not include background topic words, so the summary score of the character string is low.
- Such a character string is considered to be a summary sentence that does not include an explanation of the background topic.
- the character string “I was surprised by the heavy rain” includes the background topic word “heavy rain”, but the summary score of the character string is not given. This is because even if a background topic word is included, a character string that does not include the target topic word is considered not suitable as a summary of the topic in the target period.
- the character string “Kinkakuji is submerged due to heavy rain” is selected as the representative character string when the document in the period of “16: 00-20: 00” is the target document set.
- FIG. 9 is a diagram illustrating an example of data output by the representative character string extraction unit 30.
- a representative character string is displayed when a document in a period from 16:00 to 20:00 is set as a target document set.
- the representative character string includes a related background topic word “heavy rain”.
- the sentence including the explanation of the topic as a background is output.
- the topic of the target document set is summarized by including the target document topic word “Kinkakuji”.
- time-series document summarizing apparatus 201 As described above, according to the time-series document summarizing apparatus 201 according to the present embodiment, topics in a certain period are summarized from a large amount of documents having time information, and the background topics are explained. A summary sentence can be output.
- the background topic word extraction unit 20 includes a target document set, a set of target document topic words that are feature words of the target document set, and a target A reference document set, which is a document set different from the document set, is acquired, and background topic words representing topics serving as backgrounds of topics described in the document set of interest are extracted from the reference document set.
- the representative character string extracting unit 30 extracts a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set.
- a document set different from the document set of interest is prepared and feature words are extracted, and the extracted feature words are used as background topic words. Then, a character string including two types of background topic word and target document topic word is extracted from the target document set.
- the degree of association between transmission sources is calculated from the similarity of word groups included in documents created by each transmission source in the past.
- the appearance frequency of each word at each time is totaled, and only words whose appearance frequency greatly increases at any part of the period are extracted as potential topic candidate words.
- the techniques described in Patent Documents 2 and 3 provide a background topic representing a topic that is the background of the topic described in the target document set, like the time-series document summarization device according to the embodiment of the present invention. This is completely different from the configuration for extracting words from the reference document set.
- the time-series document summarization device not only the feature word included in the target document set, that is, the target document topic word, but also the character representing the background topic, that is, the character further including the background topic word
- a column is extracted from the character strings included in the target document set and extracted as a representative character string. More specifically, a document set different from the target document set is prepared, a feature word of this document set is extracted as a background topic word, and a character string including two types of the background topic word and the target document topic word is selected as the target document. Extract from set.
- an appropriate summary sentence is collected from a set of documents by the minimum configuration including the background topic word extraction unit 20 and the representative character string extraction unit 30. It is possible to achieve the object of the present invention to output.
- the background topic word extraction unit 20 acquires a document set including documents created or released in the past as a reference document set rather than the target document set. .
- the background topic word extraction unit 20 extracts many words included in the reference document set or words included in a biased manner as background topic words.
- an appropriate background topic word can be more reliably acquired from the reference document set. That is, it is possible to acquire words related to contents that have been discussed to some extent in the past as background topic words.
- the background topic word extraction unit 20 calculates the degree of association between the target document topic word and the background topic word. Then, the representative character string extraction unit 30 calculates the score of the character string included in the target document set based on the relevance calculated by the background topic word extraction unit 20, and determines the character string having a high score as the representative character string.
- the background topic word extraction unit 20 includes in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set. Relevance is calculated based on co-occurrence or similarity of co-occurrence words.
- the target document topic word extraction unit 10 acquires the target document set, and focuses on a word representing the topic of the target document set included in the target document set. Extracted as document topic words. Then, the background topic word extraction unit 20 acquires the target document topic word extracted by the target document topic word extraction unit 10.
- the target document set and the target document topic word can be automatically acquired, and the apparatus can function more comprehensively as a device for creating a summary sentence of the target document set.
- the time series document summarization apparatus is configured to include the target document topic word extraction unit 10, the present invention is not limited to this.
- the configuration may be such that the topic topic word extraction unit 20 does not include the target document topic word extraction unit 10 and the background topic word extraction unit 20 acquires a set of the target document set and the target document topic word from outside the time-series document summarization apparatus 201.
- the time-series document summarization apparatus 201 may be configured to accept designation of a set of a target document set and a target document topic word from a user.
- a time-series document summarization device for outputting a summary sentence of a target document set which is a target document set, The target document set, a set of target document topic words that are characteristic words of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set.
- a background topic word extraction unit for extracting a background topic word representing a topic that is a background of a topic that is from the reference document set;
- a representative character string extraction unit for extracting a representative character string including the target document topic word and the background topic word from the character strings included in the target document set as a summary sentence of the target document set; , Time-series document summarization device.
- Appendix 2 The time series document summarization device according to appendix 1, wherein the background topic word extraction unit acquires a document set including documents created or released in the past as the reference document set as the reference document set.
- Appendix 3 The time-series document summarization device according to appendix 2, wherein the background topic word extraction unit extracts words included in the reference document set in large numbers or words included in a biased manner as the background topic words.
- the background topic word extraction unit calculates a degree of association between the target document topic word and the background topic word
- the representative character string extracting unit calculates a score of a character string included in the target document set based on the relevance calculated by the background topic word extracting unit, and the character string having a high score is represented by the representative character string. 4.
- the time-series document summarization device according to any one of appendices 1 to 3, which is a character string.
- the background topic word extraction unit is based on co-occurrence or similarity of co-occurrence words in the document of the target document topic word and the background topic word in at least one of the target document set and the reference document set.
- the time-series document summarization device according to appendix 4 which calculates the relevance level.
- the time-series document summarization apparatus further includes: A document-of-interest topic word extraction unit for acquiring the document-of-interest collection and extracting a word representing the topic of the document-of-interest document included in the document-of-interest collection as the document-of-interest topic word; The time series document summarization device according to any one of appendices 1 to 5, wherein the background topic word extraction unit acquires the target document topic word extracted by the target document topic word extraction unit.
- Appendix 8 The time-series document summarizing method according to appendix 7, wherein in the step of extracting the background topic word, a document set including documents created or released in the past than the target document set is acquired as the reference document set.
- Appendix 9 9. The time-series document summarizing method according to appendix 8, wherein in the step of extracting the background topic word, a plurality of words included in the reference document set or words included in a biased manner are extracted as the background topic word.
- the above time series document summarization method further includes: Obtaining the target document set, and extracting a word representing the topic of the target document set included in the target document set as the target document topic word, 12.
- the time-series document summarization method according to any one of appendices 7 to 11, wherein in the step of extracting the background topic word, the extracted document topic word of interest is acquired.
- the program is on the computer
- the target document set, a target document topic word set that is a characteristic word of the target document set, and a reference document set that is a document set different from the target document set are acquired and described in the target document set.
- Appendix 14 The computer-readable recording medium according to appendix 13, wherein in the step of extracting the background topic word, a document set including documents created or released in the past than the target document set is acquired as the reference document set. .
- Appendix 15 15. The computer-readable recording medium according to appendix 14, wherein in the step of extracting the background topic word, a plurality of words included in the reference document set or words included in a biased manner are extracted as the background topic word.
- the time-series document summarization program is further stored in a computer.
- the present invention for example, in a microblog, it is possible to output a summary sentence that summarizes a topic of a certain period from a large amount of documents having time information and includes an explanation of a background topic. Therefore, the present invention has industrial applicability.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
図3は、本発明の実施の形態に係る時系列文書要約装置の概略構成図である。
次に、時系列文書要約装置201における各種機能を提供するための制御構造について説明する。
次に、本発明の実施の形態に係る時系列文書要約装置の動作について図面を用いて説明する。本発明の実施の形態では、時系列文書要約装置201を動作させることによって、本発明の実施の形態に係る時系列文書要約方法が実施される。よって、本発明の実施の形態に係る時系列文書要約方法の説明は、以下の時系列文書要約装置201の動作説明に代える。なお、以下の説明においては、適宜図4を参照する。
対象となる文書集合である着目文書集合の要約文を出力するための時系列文書要約装置であって、
上記着目文書集合、および上記着目文書集合の特徴語である着目文書話題語の組と、上記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、上記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を上記参照用文書集合から抽出するための背景話題語抽出部と、
上記着目文書集合に含まれる文字列の中から、上記着目文書話題語および上記背景話題語を含む代表文字列を、上記着目文書集合の要約文として抽出するための代表文字列抽出部とを備える、時系列文書要約装置。
上記背景話題語抽出部は、上記着目文書集合よりも過去に作成または公開された文書を含む文書集合を上記参照用文書集合として取得する、付記1に記載の時系列文書要約装置。
上記背景話題語抽出部は、上記参照用文書集合に多数含まれる語または偏って含まれる語を上記背景話題語として抽出する、付記2に記載の時系列文書要約装置。
上記背景話題語抽出部は、上記着目文書話題語と上記背景話題語との関連度を計算し、
上記代表文字列抽出部は、上記背景話題語抽出部によって計算された上記関連度に基づいて、上記着目文書集合に含まれる文字列のスコアを計算し、高いスコアを持つ上記文字列を上記代表文字列とする、付記1から3のいずれかに記載の時系列文書要約装置。
上記背景話題語抽出部は、上記着目文書集合および上記参照用文書集合の少なくとも一方における、上記着目文書話題語および上記背景話題語の文書内の共起性または共起語の類似性に基づいて、上記関連度を計算する、付記4に記載の時系列文書要約装置。
上記時系列文書要約装置は、さらに、
上記着目文書集合を取得し、上記着目文書集合に含まれる、上記着目文書集合の話題を表す語を上記着目文書話題語として抽出するための着目文書話題語抽出部を備え、
上記背景話題語抽出部は、上記着目文書話題語抽出部によって抽出された上記着目文書話題語を取得する、付記1から5のいずれかに記載の時系列文書要約装置。
対象となる文書集合である着目文書集合の要約文を出力する時系列文書要約方法であって、
上記着目文書集合、および上記着目文書集合の特徴語である着目文書話題語の組と、上記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、上記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を上記参照用文書集合から抽出するステップと、
上記着目文書集合に含まれる文字列の中から、上記着目文書話題語および上記背景話題語を含む代表文字列を、上記着目文書集合の要約文として抽出するステップとを含む、時系列文書要約方法。
上記背景話題語を抽出するステップにおいては、上記着目文書集合よりも過去に作成または公開された文書を含む文書集合を上記参照用文書集合として取得する、付記7に記載の時系列文書要約方法。
上記背景話題語を抽出するステップにおいては、上記参照用文書集合に多数含まれる語または偏って含まれる語を上記背景話題語として抽出する、付記8に記載の時系列文書要約方法。
上記背景話題語を抽出するステップにおいては、上記着目文書話題語と上記背景話題語との関連度を計算し、
上記代表文字列を抽出するステップにおいては、計算した上記関連度に基づいて、上記着目文書集合に含まれる文字列のスコアを計算し、高いスコアを持つ上記文字列を上記代表文字列とする、付記7から9のいずれかに記載の時系列文書要約方法。
上記背景話題語を抽出するステップにおいては、上記着目文書集合または上記参照用文書集合における、上記着目文書話題語および上記背景話題語の文書内の共起性または共起語の類似性に基づいて、上記関連度を計算する、付記10に記載の時系列文書要約方法。
上記時系列文書要約方法は、さらに、
上記着目文書集合を取得し、上記着目文書集合に含まれる、上記着目文書集合の話題を表す語を上記着目文書話題語として抽出するステップを含み、
上記背景話題語を抽出するステップにおいては、抽出した上記着目文書話題語を取得する、付記7から11のいずれかに記載の時系列文書要約方法。
対象となる文書集合である着目文書集合の要約文を出力するための時系列文書要約装置において用いられる時系列文書要約プログラムを記録した、コンピュータ読み取り可能な記録媒体であって、上記時系列文書要約プログラムは、コンピュータに、
上記着目文書集合、および上記着目文書集合の特徴語である着目文書話題語の組と、上記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、上記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を上記参照用文書集合から抽出するステップと、
上記着目文書集合に含まれる文字列の中から、上記着目文書話題語および上記背景話題語を含む代表文字列を、上記着目文書集合の要約文として抽出するステップとを実行させるためのプログラムである、コンピュータ読み取り可能な記録媒体。
上記背景話題語を抽出するステップにおいては、上記着目文書集合よりも過去に作成または公開された文書を含む文書集合を上記参照用文書集合として取得する、付記13に記載のコンピュータ読み取り可能な記録媒体。
上記背景話題語を抽出するステップにおいては、上記参照用文書集合に多数含まれる語または偏って含まれる語を上記背景話題語として抽出する、付記14に記載のコンピュータ読み取り可能な記録媒体。
上記背景話題語を抽出するステップにおいては、上記着目文書話題語と上記背景話題語との関連度を計算し、
上記代表文字列を抽出するステップにおいては、計算した上記関連度に基づいて、上記着目文書集合に含まれる文字列のスコアを計算し、高いスコアを持つ上記文字列を上記代表文字列とする、付記13から15のいずれかに記載のコンピュータ読み取り可能な記録媒体。
上記背景話題語を抽出するステップにおいては、上記着目文書集合または上記参照用文書集合における、上記着目文書話題語および上記背景話題語の文書内の共起性または共起語の類似性に基づいて、上記関連度を計算する、付記16に記載のコンピュータ読み取り可能な記録媒体。
上記時系列文書要約プログラムは、さらに、コンピュータに、
上記着目文書集合を取得し、上記着目文書集合に含まれる、上記着目文書集合の話題を表す語を上記着目文書話題語として抽出するステップを実行させるためのプログラムであり、
上記背景話題語を抽出するステップにおいては、抽出した上記着目文書話題語を取得する、付記13から17のいずれかに記載の時系列文書要約プログラム。
20 背景話題語抽出部
30 代表文字列抽出部
101 CPU
102 メインメモリ
103 ハードディスク
104 入力インタフェース
105 表示コントローラ
106 データリーダ/ライタ
107 通信インタフェース
108 キーボード
109 マウス
110 ディスプレイ
111 記録媒体
121 バス
201 時系列文書要約装置
Claims (8)
- 対象となる文書集合である着目文書集合の要約文を出力するための時系列文書要約装置であって、
前記着目文書集合、および前記着目文書集合の特徴語である着目文書話題語の組と、前記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、前記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を前記参照用文書集合から抽出するための背景話題語抽出部と、
前記着目文書集合に含まれる文字列の中から、前記着目文書話題語および前記背景話題語を含む代表文字列を、前記着目文書集合の要約文として抽出するための代表文字列抽出部とを備える、時系列文書要約装置。 - 前記背景話題語抽出部は、前記着目文書集合よりも過去に作成または公開された文書を含む文書集合を前記参照用文書集合として取得する、請求の範囲第1項に記載の時系列文書要約装置。
- 前記背景話題語抽出部は、前記参照用文書集合に多数含まれる語または偏って含まれる語を前記背景話題語として抽出する、請求の範囲第2項に記載の時系列文書要約装置。
- 前記背景話題語抽出部は、前記着目文書話題語と前記背景話題語との関連度を計算し、
前記代表文字列抽出部は、前記背景話題語抽出部によって計算された前記関連度に基づいて、前記着目文書集合に含まれる文字列のスコアを計算し、高いスコアを持つ前記文字列を前記代表文字列とする、請求の範囲第1項から第3項のいずれかに記載の時系列文書要約装置。 - 前記背景話題語抽出部は、前記着目文書集合および前記参照用文書集合の少なくとも一方における、前記着目文書話題語および前記背景話題語の文書内の共起性または共起語の類似性に基づいて、前記関連度を計算する、請求の範囲第4項に記載の時系列文書要約装置。
- 前記時系列文書要約装置は、さらに、
前記着目文書集合を取得し、前記着目文書集合に含まれる、前記着目文書集合の話題を表す語を前記着目文書話題語として抽出するための着目文書話題語抽出部を備え、
前記背景話題語抽出部は、前記着目文書話題語抽出部によって抽出された前記着目文書話題語を取得する、請求の範囲第1項から第5項のいずれかに記載の時系列文書要約装置。 - 対象となる文書集合である着目文書集合の要約文を出力する時系列文書要約方法であって、
前記着目文書集合、および前記着目文書集合の特徴語である着目文書話題語の組と、前記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、前記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を前記参照用文書集合から抽出するステップと、
前記着目文書集合に含まれる文字列の中から、前記着目文書話題語および前記背景話題語を含む代表文字列を、前記着目文書集合の要約文として抽出するステップとを含む、時系列文書要約方法。 - 対象となる文書集合である着目文書集合の要約文を出力するための時系列文書要約装置において用いられる時系列文書要約プログラムを記録した、コンピュータ読み取り可能な記録媒体であって、前記時系列文書要約プログラムは、コンピュータに、
前記着目文書集合、および前記着目文書集合の特徴語である着目文書話題語の組と、前記着目文書集合とは異なる文書集合である参照用文書集合とを取得し、前記着目文書集合において記述されている話題の背景となる話題を表す背景話題語を前記参照用文書集合から抽出するステップと、
前記着目文書集合に含まれる文字列の中から、前記着目文書話題語および前記背景話題語を含む代表文字列を、前記着目文書集合の要約文として抽出するステップとを実行させるためのプログラムである、コンピュータ読み取り可能な記録媒体。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/982,523 US20130311471A1 (en) | 2011-02-15 | 2011-12-09 | Time-series document summarization device, time-series document summarization method and computer-readable recording medium |
JP2012557792A JP5884740B2 (ja) | 2011-02-15 | 2011-12-09 | 時系列文書要約装置、時系列文書要約方法および時系列文書要約プログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011029705 | 2011-02-15 | ||
JP2011-029705 | 2011-02-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012111226A1 true WO2012111226A1 (ja) | 2012-08-23 |
Family
ID=46672175
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/078517 WO2012111226A1 (ja) | 2011-02-15 | 2011-12-09 | 時系列文書要約装置、時系列文書要約方法およびコンピュータ読み取り可能な記録媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20130311471A1 (ja) |
JP (1) | JP5884740B2 (ja) |
WO (1) | WO2012111226A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015064650A (ja) * | 2013-09-24 | 2015-04-09 | ビッグローブ株式会社 | 情報処理装置、記事情報生成方法およびプログラム |
JP2015169969A (ja) * | 2014-03-04 | 2015-09-28 | Nttコムオンライン・マーケティング・ソリューション株式会社 | 話題特定装置、および話題特定方法 |
JP2019046016A (ja) * | 2017-08-31 | 2019-03-22 | ヤフー株式会社 | 算出装置、算出方法及び算出プログラム |
JP7553314B2 (ja) | 2020-10-13 | 2024-09-18 | 株式会社リクルート | 推定装置、推定方法及びプログラム |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9767165B1 (en) | 2016-07-11 | 2017-09-19 | Quid, Inc. | Summarizing collections of documents |
US10679002B2 (en) | 2017-04-13 | 2020-06-09 | International Business Machines Corporation | Text analysis of narrative documents |
EP3432155A1 (en) * | 2017-07-17 | 2019-01-23 | Siemens Aktiengesellschaft | Method and system for automatic discovery of topics and trends over time |
CN110727789A (zh) * | 2018-06-29 | 2020-01-24 | 微软技术许可有限责任公司 | 文档的概要生成 |
CN109117485B (zh) * | 2018-09-06 | 2023-08-08 | 北京汇钧科技有限公司 | 祝福语文本生成方法和装置、计算机可读存储介质 |
US11790184B2 (en) * | 2020-08-28 | 2023-10-17 | Salesforce.Com, Inc. | Systems and methods for scientific contribution summarization |
JP2024008334A (ja) | 2022-07-08 | 2024-01-19 | 株式会社東芝 | 情報処理装置、情報処理方法およびプログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10207891A (ja) * | 1997-01-17 | 1998-08-07 | Fujitsu Ltd | 文書要約装置およびその方法 |
JPH11219361A (ja) * | 1998-02-02 | 1999-08-10 | Fujitsu Ltd | 文書閲覧装置およびそのプログラムを格納した記憶媒体 |
JP2001084255A (ja) * | 1999-09-10 | 2001-03-30 | Fuji Xerox Co Ltd | 文書検索装置および方法 |
JP2002259371A (ja) * | 2001-03-02 | 2002-09-13 | Nippon Telegr & Teleph Corp <Ntt> | 文書要約方法および装置と文書要約プログラムおよび該プログラムを記録した記録媒体 |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003141027A (ja) * | 2001-10-31 | 2003-05-16 | Toshiba Corp | 要約作成方法および要約作成支援装置およびプログラム |
GB2399427A (en) * | 2003-03-12 | 2004-09-15 | Canon Kk | Apparatus for and method of summarising text |
JP4333318B2 (ja) * | 2003-10-17 | 2009-09-16 | 日本電信電話株式会社 | 話題構造抽出装置及び話題構造抽出プログラム及び話題構造抽出プログラムを記録したコンピュータ読み取り可能な記憶媒体 |
US7480669B2 (en) * | 2005-02-15 | 2009-01-20 | Infomato | Crosslink data structure, crosslink database, and system and method of organizing and retrieving information |
US7577646B2 (en) * | 2005-05-02 | 2009-08-18 | Microsoft Corporation | Method for finding semantically related search engine queries |
US7702680B2 (en) * | 2006-11-02 | 2010-04-20 | Microsoft Corporation | Document summarization by maximizing informative content words |
WO2008083504A1 (en) * | 2007-01-10 | 2008-07-17 | Nick Koudas | Method and system for information discovery and text analysis |
US20080301120A1 (en) * | 2007-06-04 | 2008-12-04 | Precipia Systems Inc. | Method, apparatus and computer program for managing the processing of extracted data |
US8781989B2 (en) * | 2008-01-14 | 2014-07-15 | Aptima, Inc. | Method and system to predict a data value |
US8606810B2 (en) * | 2008-01-30 | 2013-12-10 | Nec Corporation | Information analyzing device, information analyzing method, information analyzing program, and search system |
WO2009096523A1 (ja) * | 2008-01-30 | 2009-08-06 | Nec Corporation | 情報分析装置、検索システム、情報分析方法及び情報分析用プログラム |
US20100185943A1 (en) * | 2009-01-21 | 2010-07-22 | Nec Laboratories America, Inc. | Comparative document summarization with discriminative sentence selection |
US8843476B1 (en) * | 2009-03-16 | 2014-09-23 | Guangsheng Zhang | System and methods for automated document topic discovery, browsable search and document categorization |
JP5879260B2 (ja) * | 2009-06-09 | 2016-03-08 | イービーエイチ エンタープライズィーズ インコーポレイテッド | マイクロブログメッセージの内容を分析する方法及び装置 |
US8533208B2 (en) * | 2009-09-28 | 2013-09-10 | Ebay Inc. | System and method for topic extraction and opinion mining |
JP5284990B2 (ja) * | 2010-01-08 | 2013-09-11 | インターナショナル・ビジネス・マシーンズ・コーポレーション | キーワードの時系列解析のための処理方法、並びにその処理システム及びコンピュータ・プログラム |
US8326880B2 (en) * | 2010-04-05 | 2012-12-04 | Microsoft Corporation | Summarizing streams of information |
US9286619B2 (en) * | 2010-12-27 | 2016-03-15 | Microsoft Technology Licensing, Llc | System and method for generating social summaries |
US8990065B2 (en) * | 2011-01-11 | 2015-03-24 | Microsoft Technology Licensing, Llc | Automatic story summarization from clustered messages |
-
2011
- 2011-12-09 US US13/982,523 patent/US20130311471A1/en not_active Abandoned
- 2011-12-09 WO PCT/JP2011/078517 patent/WO2012111226A1/ja active Application Filing
- 2011-12-09 JP JP2012557792A patent/JP5884740B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10207891A (ja) * | 1997-01-17 | 1998-08-07 | Fujitsu Ltd | 文書要約装置およびその方法 |
JPH11219361A (ja) * | 1998-02-02 | 1999-08-10 | Fujitsu Ltd | 文書閲覧装置およびそのプログラムを格納した記憶媒体 |
JP2001084255A (ja) * | 1999-09-10 | 2001-03-30 | Fuji Xerox Co Ltd | 文書検索装置および方法 |
JP2002259371A (ja) * | 2001-03-02 | 2002-09-13 | Nippon Telegr & Teleph Corp <Ntt> | 文書要約方法および装置と文書要約プログラムおよび該プログラムを記録した記録媒体 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015064650A (ja) * | 2013-09-24 | 2015-04-09 | ビッグローブ株式会社 | 情報処理装置、記事情報生成方法およびプログラム |
JP2015169969A (ja) * | 2014-03-04 | 2015-09-28 | Nttコムオンライン・マーケティング・ソリューション株式会社 | 話題特定装置、および話題特定方法 |
JP2019046016A (ja) * | 2017-08-31 | 2019-03-22 | ヤフー株式会社 | 算出装置、算出方法及び算出プログラム |
JP7388617B2 (ja) | 2017-08-31 | 2023-11-29 | Lineヤフー株式会社 | 算出装置、算出方法及び算出プログラム |
JP7553314B2 (ja) | 2020-10-13 | 2024-09-18 | 株式会社リクルート | 推定装置、推定方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2012111226A1 (ja) | 2014-07-03 |
JP5884740B2 (ja) | 2016-03-15 |
US20130311471A1 (en) | 2013-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5884740B2 (ja) | 時系列文書要約装置、時系列文書要約方法および時系列文書要約プログラム | |
Nguyen et al. | Computational sociolinguistics: A survey | |
CN106649818B (zh) | 应用搜索意图的识别方法、装置、应用搜索方法和服务器 | |
Bansal et al. | On predicting elections with hybrid topic based sentiment analysis of tweets | |
Ruder et al. | Character-level and multi-channel convolutional neural networks for large-scale authorship attribution | |
JP5647508B2 (ja) | ショートテキスト通信のトピックを識別するためのシステムおよび方法 | |
Mostafa | More than words: Social networks’ text mining for consumer brand sentiments | |
JP6629246B2 (ja) | クエリー曖昧性除去のための文脈に応じたコンテンツ取得ルールの学習と使用 | |
US8924491B2 (en) | Tracking message topics in an interactive messaging environment | |
US20200073485A1 (en) | Emoji prediction and visual sentiment analysis | |
US8782042B1 (en) | Method and system for identifying entities | |
Furini et al. | Sentiment analysis and twitter: a game proposal | |
CN110727785A (zh) | 推荐模型的训练、搜索文本的推荐方法、装置及存储介质 | |
US20210248687A1 (en) | System and method for predicting engagement on social media | |
US8290925B1 (en) | Locating product references in content pages | |
CN104881447A (zh) | 搜索方法及装置 | |
Hernandez et al. | Constructing consumer profiles from social media data | |
CN110430448B (zh) | 一种弹幕处理方法、装置及电子设备 | |
Muralikumar et al. | A human-centered evaluation of a toxicity detection api: Testing transferability and unpacking latent attributes | |
Rahman et al. | Enhancing lecture video navigation with AI generated summaries | |
JPWO2016103519A1 (ja) | データ分析システム、データ分析方法、およびデータ分析プログラム | |
KR101105798B1 (ko) | 키워드 정련 장치 및 방법과 그를 위한 컨텐츠 검색 시스템 및 그 방법 | |
WO2016063403A1 (ja) | データ分析システム、データ分析方法、およびデータ分析プログラム | |
CN110659419A (zh) | 确定目标用户的方法及相关装置 | |
Cela et al. | Sexualization and Emotional Valence in Audience Reactions to Popular Music Video Through Automated Language Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11858890 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2012557792 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13982523 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11858890 Country of ref document: EP Kind code of ref document: A1 |