WO2012008503A1

WO2012008503A1 - Passage extraction apparatus and method of passage extraction

Info

Publication number: WO2012008503A1
Application number: PCT/JP2011/066017
Authority: WO
Inventors: 辰則森; 英潔渋木; 正寛中野; 林太郎宮▲崎▼; 円香石下
Original assignee: 国立大学法人横浜国立大学
Priority date: 2010-07-13
Filing date: 2011-07-13
Publication date: 2012-01-19
Also published as: JPWO2012008503A1; JP5858407B2

Abstract

Disclosed is a passage extraction apparatus and so forth whereby a Web or a document database is searched, and a direct mediatory summary whereby a passage briefly explaining a situation wherein a focus sentence determined to be true or false is established along with a situation wherein an opposing sentence for an opposition is established is extracted. In particular, the relevance and impartiality with the focus sentence and the density of characteristic language is considered. A document is searched for with a focus statement as a condition; in addition a document is searched for with an opposing statement as a condition; a pure focus search document collection relating only to the focus statement and a pure opposing search document collection relating only to the opposing statement are classified (S103); a word score is calculated (S104, S105) from the frequency of pure focus search documents and from the frequency of pure opposing search documents; in addition the characteristic language of the affirmative argument and the characteristic language of the negative argument are determined (S2801); and a sentence score is obtained from the characteristic language (S3501) and a passage score is calculated (S107).

Description

Passage extraction apparatus and passage extraction method

The present invention relates to passage extraction that obtains a sentence that briefly explains a situation in which a conflict sentence is established together with a situation in which a sentence of interest is established.

Suppose that a user wants to judge whether a sentence is true or false in light of the information available in the world. At this time, it is often observed that the sentence is true or false in any situation, but is true in one situation and false in another.

In such a case, if it is possible to find a sentence that clearly explains the content of the target sentence under what circumstances and the content of the conflicted sentence under other circumstances, it can be judged as true or false. Useful.

Suppose, for example, that you want to know the truth of the sentence of interest: “Diesel cars are good for the environment.” The confrontation is “diesel is bad for the environment”. In fact, any content is true depending on the situation.

For example, I think that the evaluation was divided depending on whether the focus was on CO2 (greenhouse gas) or NOx (air pollution such as photochemical smog). Although it has been said that CO2 emissions are small, but NOx and solids emissions are large, but it has been dramatically improved by the development of catalysts and filters, and because of its good fuel efficiency, in Europe, It is desirable to be able to automatically find sentences such as “Received popularity”.

Non-Patent Document 1 discloses a method for determining whether or not two sentences are in conflict and a method for determining whether or not a certain sentence includes a conditional expression. Non-Patent Document 2 discloses a summarization technique for finding another sentence that briefly explains a plurality of sentences. However, even these known methods cannot achieve the above-mentioned object.

The problem to be solved is the situation in which the target sentence that represents the matter that the user wants to make a true / false judgment is established and the opposite sentence in which the target sentence and the truth are opposite is established It is an object to realize a direct mediation summary that obtains a sentence that briefly explains.

The passage extraction device according to the present invention is:
A passage extraction device that extracts from a search document a passage including affirmative contents and negative contents with respect to a statement of interest indicating a matter for judging authenticity, and has the following elements: (1) Input a statement of interest Attention statement input unit (2) An allegation statement specifying unit (3) that identifies an opposition statement showing the content opposite to the attentive statement (3) A statement-related document retrieval unit that retrieves a document based on the attentive statement and retrieves a document based on the opposition statement ( 4) For each word included in the net focus search document that is searched with the focus statement and not searched with the conflict statement, the number of net focus search documents including the word is calculated to obtain the net focus search document frequency. Pure conflict search document frequency is calculated by calculating the number of pure conflict search documents including the word for each word included in the pure conflict search document that is searched with the conflict statement and not searched with the statement of interest. Demand,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set for each word. A word score calculation unit that calculates a word score indicating a positive characteristic and a negative characteristic with respect to a statement of interest of the word based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure conflict search document set 6) A passage score calculation unit (7) that calculates, for each passage, a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest of the passage, based on the word score of the word included in the passage. A passage output unit that outputs passages based on the passage score.

Further, the passage extraction device further includes:
A feature word determination unit that determines, based on the word score, an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion, and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
For each sentence included in the document searched with the statement of interest and / or the document searched with the conflict statement, the sentence score is obtained by counting the number of positive feature words and negative feature words included in the sentence. Have a calculator,
The passage score calculation unit sets a maximum sentence score among sentences included in the passage as a passage score.

In addition, the feature word determination unit determines a topic feature word that is a content word of the statement of interest that does not correspond to the positive feature word and the negative feature word,
The sentence score calculation unit is characterized in that the number of topic feature words included in the sentence is counted and added to the sentence score.

The passage score calculation unit may multiply the maximum sentence score by a bonus coefficient larger than 1 to obtain a passage score when the passage includes a positive feature word, a negative feature word, and a topic feature word. And

Further, the sentence score calculation unit multiplies the sentence score by a low bonus coefficient larger than 1 when the passage includes either a positive feature word or a negative feature word and a topic feature word. When both feature words and negative side feature words and topic feature words are included, the sentence score is multiplied by a high bonus coefficient larger than the low bonus coefficient.

Furthermore, it has a passage range setting section for setting a passage range based on a sentence score.

Further, the passage extraction device further includes:
Based on the word score, it has a feature word determination unit that determines an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
The passage score calculation unit calculates a passage score based on the number of positive feature words and the number of negative feature words included in the passage.

In addition, the feature word determination unit determines a topic feature word that is not a positive feature word and a negative feature word and is a content word of the statement of interest,
The passage score calculation unit further calculates a passage score based on the number of topic feature words included in the passage.

Further, the passage extraction device further includes:
Based on the word score, it has a feature word determination unit that determines an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
The passage score calculation unit calculates, for each affirmative feature word, the number of sentences including the affirmative feature word among sentences included in the passage, and obtains the appearance frequency of the affirmative feature word, and the negative feature For each word, out of the sentences included in the passage, the number of sentences including the negative feature word is counted to obtain the appearance frequency of the negative feature word, and the appearance frequency of the positive feature word and the negative feature word A passage score is calculated based on the appearance frequency.

In addition, the feature word determination unit determines a topic feature word that is a content word of the statement of interest that does not correspond to the positive feature word and the negative feature word,
The passage score calculation unit further calculates, for each topic feature word, the number of sentences including the topic feature word among the sentences included in the passage, and obtains the appearance frequency of the topic feature word. A passage score is calculated based on the appearance frequency.

In addition, the feature word determination unit obtains the ranking of the affirmative characteristic with respect to the focused statement and the ranking of the negated characteristic with respect to the focused statement based on the word score, and determines the ranking of the positive characteristic with respect to the focused statement and the focused statement. Based on the ranking of the negative characteristics with respect to, a positive side feature word and a negative side feature word are determined.

The passage extraction method according to the present invention is as follows:
A passage extraction method by a passage extraction device that extracts a passage including a positive content and a negative content with respect to a statement of interest indicating a matter for judging true / false from a search document, and has the following elements (1) Statement-of-interest input process for inputting a statement (2) Conflict-statement specifying step for identifying an opposing statement indicating the opposite content of the statement of interest (3) Statement-related for retrieving a document based on the statement of interest and retrieving a document based on the conflicting statement Document Retrieval Step (4) For each word included in a net focus search document that has been searched with a focus statement and not searched with a conflict statement, a net focus search document is calculated by calculating the number of net focus search documents that include the word. For each word included in a pure conflict search document that is searched for by conflict statement but not searched by the statement of interest, the number of pure conflict search documents including the word is calculated. Determine the net confrontation search document frequency by,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set Based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure confrontation search document set, the word score calculation that calculates the word score indicating the positive characteristic and the negative characteristic with respect to the statement of interest of the word (6) For each passage, a passage score calculation step of calculating a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest of the passage based on the word score of the word included in the passage. 7) A passage output step of outputting a passage based on the passage score.

The program according to the present invention is:
(1) A statement of interest characterized by causing a computer to be a passage extraction device to extract a passage including a positive content and a negative content with respect to a statement of interest indicating an item for judging true / false from a search document by executing the following procedure. Input statement of interest statement input procedure (2) Conflict statement identification procedure for specifying conflict statement indicating content opposite to statement of interest statement (3) Document-related document search for searching documents based on statement of interest and searching documents based on conflict statements Procedure (4) For each word included in the net focus search document that is searched with the focus statement and not searched with the conflict statement, the net focus search document frequency is calculated by calculating the number of net focus search documents including the word. The number of pure conflict search documents including the word is calculated for each word included in the pure conflict search document that is searched for by the conflict statement and not searched by the statement of interest. Ri determine the net confrontation search document frequency,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set Frequency calculation procedure for calculating the frequency of words appearing in a pure conflict search document set by calculating (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set Based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure confrontation search document set, the word score calculation that calculates the word score indicating the positive characteristic and the negative characteristic with respect to the statement of interest of the word In order (6), for each passage, a passage score calculation procedure for calculating a passage score indicating the degree of juxtaposition of both positive and negative characteristics with respect to the statement of interest of the passage, based on the word score of the word included in the passage. 7) A passage output procedure for outputting a passage based on the passage score.

It is possible to automatically generate a direct mediation summary that supports the user's judgment regarding information credibility on the Web. A direct mediation summary is a summary that finds from a Web document a concise explanation of a situation where coexistence is possible when two statements that appear to be in conflict can actually coexist.

Especially, a mediation summary is generated directly based on relevance to the statement of interest, fairness, and feature word density. It is considered that the relevance to the statement of interest is approximately obtained depending on whether or not the word in the statement of interest is included. Fairness refers to whether the opinions and grounds that affirm the statement of interest and the negative opinions and grounds are referred to equally. It is considered that it is approximately obtained depending on whether or not a pair of words is included in both opinions and grounds. Passages with high density of feature words often describe both opinions and grounds in contrast to their validity as a concise summary, and are considered more appropriate as mediation summaries. It is done.

FIG. 1 is a diagram showing an overall processing flow. FIG. 2 is a diagram illustrating a configuration relating to a focus statement input and conflict statement specification in the passage extraction device. FIG. 3 is a diagram showing a conflict statement specifying process flow. FIG. 4 is a diagram showing a configuration relating to a statement-related document search in the passage extraction device. FIG. 5 is a diagram showing a statement related document search processing flow. FIG. 6 is a diagram showing a configuration relating to document frequency calculation in the passage extraction device. FIG. 7 is a diagram showing a document frequency calculation processing flow. FIG. 8 is a diagram showing a pure focus search document frequency calculation processing flow. FIG. 9 is a diagram showing a pure conflict search document frequency calculation processing flow. FIG. 10 is a diagram showing a duplicate search document frequency calculation processing flow. FIG. 11 is a diagram illustrating a configuration relating to word score calculation in the passage extraction device. FIG. 12 is a diagram showing a word score calculation processing flow. FIG. 13 is a diagram illustrating a configuration relating to passage range setting in the passage extraction device. FIG. 14 is a diagram showing a passage range setting process flow. FIG. 15 is a diagram illustrating a configuration relating to passage score calculation in the passage extraction device. FIG. 16 is a diagram showing a passage score calculation processing flow. FIG. 17 is a diagram illustrating a configuration relating to passage selection and passage output in the passage extraction device. FIG. 18 is a diagram illustrating a configuration relating to word score calculation in the passage extraction device according to the second embodiment. FIG. 19 is a diagram showing a word score calculation processing flow in the second embodiment. FIG. 20 is a diagram showing a passage score calculation processing flow in the second embodiment. FIG. 21 is a diagram showing a word score calculation processing flow in the third embodiment. FIG. 22 is a diagram showing a word score calculation processing flow in the fourth embodiment. FIG. 23 is a diagram showing a word score calculation processing flow in the fifth embodiment. FIG. 24 is a diagram showing a passage score calculation processing flow in the fifth embodiment. FIG. 25 is a diagram showing a word score calculation processing flow in the sixth embodiment. FIG. 26 is a diagram showing a word score calculation processing flow in the seventh embodiment. FIG. 27 is a diagram showing a word score calculation processing flow in the eighth embodiment. FIG. 28 is a diagram showing an overall processing flow in the ninth embodiment. FIG. 29 is a diagram illustrating a configuration related to feature word determination in the passage extraction device. FIG. 30 is a diagram showing a feature word determination processing flow (No. 1). FIG. 31 is a diagram showing a feature word determination processing flow (2). FIG. 32 is a diagram showing a passage score calculation process flow in the ninth embodiment. FIG. 33 is a diagram showing a passage score calculation processing flow in the tenth embodiment. FIG. 34 is a diagram illustrating a feature word determination processing flow according to the eleventh embodiment. FIG. 35 is a diagram showing an overall processing flow in the twelfth embodiment. FIG. 36 is a diagram showing a sentence score calculation processing flow. FIG. 37 is a diagram showing a passage score calculation process flow according to the twelfth embodiment. FIG. 38 is a diagram showing a passage score calculation processing flow in the thirteenth embodiment. FIG. 39 is a diagram showing a sentence score calculation processing flow in the fourteenth embodiment. FIG. 40 is a diagram showing a passage range setting process flow according to the fifteenth embodiment. FIG. 41 is a diagram illustrating a hardware configuration of the passage extraction device.

Embodiment 1 FIG.
If the passage extraction device is a server that is connected to a client terminal via a network such as the Internet or an intranet, use a search engine that is held by itself or use another search server via a network such as the Internet or an intranet. It is configured to retrieve documents and extract passages. The user transmits a statement of interest from the client terminal to the passage extraction device, receives the passage as the extraction result.

When the passage extraction device is a user's client terminal, the passage is extracted by searching for a document using a search server via a network such as the Internet or an intranet. The client terminal accepts the statement of interest from the character input device, displays the passage result on the screen as the extraction result.

FIG. 1 is a diagram showing an overall processing flow. The operation of the passage extraction device will be described. In the focused statement input process (S101), a focused statement indicating the content to be judged true / false (affirmative and negative) is input. In the conflict statement specifying process (S102), the conflict statement of the content that conflicts with the content of the statement of interest is specified.

In the statement related document search process (S103), a statement related document is searched on condition of the statement of interest and the conflict statement. Statement-related documents are classified into a purely focused search document set, a pure conflict search document set, and a duplicate search document set.

In the document frequency calculation process (S104), a document frequency that is the number of documents in which a predetermined word appears in the document set is calculated. Specifically, the net focus search document frequency for the net focus search document set, the net conflict search document frequency for the net conflict search document set, and the duplicate search document frequency for the duplicate search document set are calculated.

In the word score calculation process (S105), a word score indicating a positive characteristic and a negative characteristic for the statement of interest is calculated. The word score may be a positive or negative score that indicates one characteristic in a one-dimensional manner with both characteristics as opposite characteristics, or a positive score and a negative score that indicate both characteristics independently.

In the passage range setting process (S106), the passage range is determined. A passage is a series of partial sentences in a search document. The passage may be a fixed size or an arbitrary size.

In the passage score calculation process (S107), a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest is calculated. That is, the degree to which the condition that the passage has a positive characteristic and further a negative characteristic is achieved is quantified. In addition, the degree of adaptation of propositions or themes (topics) that are the premise of affirmation and denial may be evaluated together. In the passage score calculation, in addition to the method of directly using the word score, the word score is indirectly used, such as a method of judging according to the appearance mode of the feature word determined based on the word score, or a method using a sentence score. A method is conceivable.

In the passage selection process (S108), a passage is selected based on the passage score. Select a higher value, that is, a superior score. In the passage output process (S109), the selected passage is output.

First, the focused statement input process (S101) and the conflict statement specifying process (S102) will be described. FIG. 2 is a diagram illustrating a configuration relating to a focus statement input and conflict statement specification in the passage extraction device. The passage extraction apparatus includes a notice statement input unit 201, a notice statement storage unit 202, a conflict statement specifying unit 203, and a conflict statement storage unit 204.

The focus statement input unit 201 inputs a focus statement and stores it in the focus statement storage unit 202. A statement of interest is a natural sentence or a phrase indicating a matter for judging authenticity. Although it is mainly a natural sentence such as “diesel cars are good for the environment”, it is also effective for phrases such as “diesel cars that are good for the environment”. For example, an operator inputs via a character input device. Or it receives from a client terminal via a network.

The conflict statement specifying unit 203 reads the focus statement from the focus statement storage unit 202 and generates a conflict statement. FIG. 3 is a diagram showing a conflict statement specifying process flow. Content words included in the statement of interest are identified (S301). The content word is a word having a general meaning other than the function word having a grammatical role. In this example, adjectives, verbs, nouns, and saun nouns are targeted. Then, for each of the content words (S302), it is determined whether there is an antonym (S303). If there is an antonym, the content word in the statement of interest is replaced with the antonym to make an opposing statement (S304). An antonym is acquired from an antonym dictionary database. The process ends when all content words have been processed (S305). That is, the number of conflict statements in which there is an antonym is generated and stored in the conflict statement storage unit 204. When the statement of interest is a natural sentence, the conflict statement is also a natural sentence, and when the statement of attention is a phrase, the conflict statement is also a phrase. For example, a confrontation statement “diesel vehicle is bad for the environment” is generated for the focus statement “diesel vehicle is good for the environment”, and “a bad environment is good” for the statement “the diesel vehicle is good for the environment”. A conflict statement of “diesel vehicle” is generated.

In this example, a conflict statement was generated from the statement of interest using an antonym, but there is also a method of converting it into a negative form grammatically. For example, a positive sentence is converted into a negative sentence. Alternatively, there is a method of inputting a conflict statement together with the input of the statement of interest. That is, the statement of interest and the conflict statement are accepted as a pair. The method of accepting a conflict statement has the advantage that an appropriate conflict statement can be identified in line with the proposition or topic that the user is aware of.

Subsequently, the statement related document search process (S103) will be described. FIG. 4 is a diagram showing a configuration relating to a statement-related document search in the passage extraction device. The passage extraction apparatus includes a statement-related document search unit 401, a pure-focus search document storage unit 402, a pure conflict search document storage unit 403, and a duplicate search document storage unit 404 in addition to the focus statement storage unit 202 and the conflict statement storage unit 204. I have.

FIG. 5 is a diagram showing a statement-related document search processing flow. First, a document is searched on the condition of the statement of interest (S501), and a retrieval result of the statement of interest is obtained. The search target is the Web or a document database. When targeting the Web, the URL of the Web document (example of search document identification information) and the Web document data are acquired. When a document database is targeted, a document ID (an example of search document identification information) and document data are acquired.

Further, the document is searched on the condition of the conflict statement (S502), and the search result of the conflict statement is obtained. The search target is similarly the Web or a document database. When targeting the Web, the URL of the Web document (example of search document identification information) and the Web document data are acquired. When a document database is targeted, a document ID (an example of search document identification information) and document data are acquired.

When the search engine is provided inside the passage extraction device, the search condition is passed according to the internal interface and the search result is received. When an external search engine is used, the search condition is transmitted via communication such as the Internet or an intranet, and the search result is received.

In this example, it is assumed that the search engine accepts a natural sentence or a phrase as a search condition, but a search engine using a logical expression based on a word as a search condition can also be used. In that case, in the statement-related document search process (S103), a content word is specified from a natural sentence or a phrase and, for example, the content word is connected with an AND condition to generate a logical expression.

And classify the documents included in the search results. A document included in the focus statement search result and not included in the conflict statement search result is associated with the document ID as a net focus search document and stored in the net focus search document storage unit 402 (S503). A document included in the conflict statement search result and not included in the focus statement search result is associated with the document ID as a pure conflict search document and stored in the pure conflict search document storage unit 403 (S504). A document included in the search result of the statement of interest and the search result of the conflict statement is associated with the document ID as a duplicate search document and stored in the duplicate search document storage unit 404 (S505). The document ID of the search result may be used as the document ID. Moreover, you may re-shuffle.

Subsequently, the document frequency calculation process (S104) will be described. FIG. 6 is a diagram showing a configuration relating to document frequency calculation in the passage extraction device. The passage extraction apparatus includes a document frequency calculation unit 601 and a word table 602 in addition to a pure focus search document storage unit 402, a pure conflict search document storage unit 403, and a duplicate search document storage unit 404. The word table 602 is configured so that a record is provided for each word, and the net search document frequency, the net conflict search document frequency, the duplicate search document frequency, and the all search document frequency are stored in association with each other.

FIG. 7 is a diagram showing a document frequency calculation processing flow. Purely focused search document frequency calculation processing (S701), pure conflict search document frequency calculation processing (S702), and duplicate search document frequency calculation processing (S703) are sequentially performed. The net focus search document frequency is the number of net focus search documents in which the target word appears in the net focus search document set. Similarly, the pure conflict search document frequency is the number of pure conflict search documents in which the target word appears in the pure conflict search document set, and the duplicate search document frequency is the target search word frequency in the duplicate search document set. This is the number of duplicate search documents. Furthermore, the total search document frequency is the number of documents in which the target word appears in the entire set of the net focus search document, the net conflict search document, and the duplicate search document.

The net focus search document frequency calculation process (S701) is illustrated. FIG. 8 is a diagram showing a pure focus search document frequency calculation processing flow. The following processing is repeated for each purely focused search document (S801). Words included in the net focused search document are sequentially identified, and the following processing is repeated (S802). At this time, a plurality of words included in the same document are processed only once. In other words, duplication is excluded. If there is no record of the word in the word table 602 (S803), a record of the word is newly added (S804). Write the word ID and word. The initial value of the document frequency is 0. Then, 1 is added to the pure focus search document frequency and the total search document frequency (S805). This operation is processed for all the words included in the net focused search document (S806), and the process proceeds to the next pure focused search document. The process ends when all the net focus search documents have been processed (S807).

In the pure conflict search document frequency calculation process (S702), the pure conflict search document frequency and the total search document frequency are similarly counted. FIG. 9 is a diagram showing a pure conflict search document frequency calculation processing flow. The following processing is repeated for each pure conflict search document (S901), and further, the processing is repeated for each word included in the pure conflict search document (S902). Then, 1 is added to the pure conflict search document frequency and the total search document frequency (S905). Similar to the above, words appearing after the second time in the same document are ignored.

In the duplicate search document frequency calculation process (S703), the duplicate search document frequency and the total search document frequency are similarly counted. FIG. 10 is a diagram showing a duplicate search document frequency calculation processing flow. The following process is repeated for each duplicate search document (S1001), and further, the process is repeated for each word included in the duplicate search document (S1002). Then, 1 is added to the duplicate search document frequency and the total search document frequency (S1005). Similarly, in this process, even if the same word appears several times in the document, it is counted as one occurrence.

Subsequently, the word score calculation process (S105) will be described. FIG. 11 is a diagram illustrating a configuration relating to word score calculation in the passage extraction device. The passage extraction device includes a word score calculation unit 1101 and a word score table 1102 in addition to the word table 602.

In this example, the word score is “positive score = pure focus search document frequency−pure conflict search document frequency”. This word score shows both opposing positive and negative characteristics with positive and negative polarities. Examples of other word scores will be described later.

FIG. 12 is a diagram showing a word score calculation processing flow. For each word (S1201), the net focus search document frequency and the net conflict search document frequency are acquired from the word table 602, the net conflict search document frequency is subtracted from the net focus search document frequency, and a difference is obtained (S1202). The difference is stored as a positive / negative score (word score) in association with the word ID (S1203). This process is performed for all words (S1204).

Subsequently, the passage range setting process (S106) will be described. FIG. 13 is a diagram illustrating a configuration relating to passage range setting in the passage extraction device. The passage extraction apparatus includes a passage range determination unit 1301 and a passage table 1302 in addition to a pure focus search document storage unit 402, a pure conflict search document storage unit 403, and a duplicate search document storage unit 404.

FIG. 14 is a diagram showing a passage range setting process flow. The following processing is repeated for each search document (S1401). Starting sentences are sequentially selected one minute at a time from the top (S1402), and the maximum continuous sentence (passage) within a predetermined size is specified from the starting sentence (S1403). The document ID, the start sentence ID, and the end sentence ID are stored in association with the passage ID (S1404). The passage range is set by the document ID, the start sentence ID, and the end sentence ID. When all the sentences have been processed, the process proceeds to the processing of the next search document (S1405), and ends when all the documents have been processed (S1406). The predetermined size may be the total number of characters, the number of lines composed of the predetermined number of characters, or the number of sentences. The range of passages can be set in character units instead of sentence units. The search document that sets the passage range is not limited to the document searched with the statement of interest and the document searched with the conflict statement, but only the document searched with the statement of interest, or only the document searched with the conflict statement. Can also be targeted.

Subsequently, the passage score calculation process (S107) will be described. FIG. 15 is a diagram illustrating a configuration relating to passage score calculation in the passage extraction device. The passage extraction device includes a passage score calculation unit 1301 and a passage table 1302 in addition to a pure focus search document storage unit 402, a pure conflict search document storage unit 403, a duplicate search document storage unit 404, a word table 602, and a word score table 1102. I have.

FIG. 16 is a diagram showing a passage score calculation processing flow. The following processing is repeated for each set passage (S1601). In accordance with the passage range (document ID, start sentence ID, end sentence ID) set in the passage table 1302, the passage is read from each search document storage unit (S1602). The word contained in the passage is specified, the positive / negative score of each word is read from the word score table 1102, and compared to determine the maximum positive / negative score (S1603). Then, the maximum positive / negative score is set as the maximum positive score (S1604). The most affirmative score indicates the degree of a word having the highest affirmative characteristic by the magnitude of the value. Similarly, the minimum positive score among the words included in the passage is also determined (S1605), and the absolute value of the minimum positive score is set as the maximum negative score (S1606). The most negative score indicates the degree of the negative characteristic of the word with the highest value. The most negative score is multiplied by the most negative score, and the product is stored as a passage score (S1607). These processes are terminated when all passages have been processed (S1608). There is also a method in which the most negative score is added to the most positive score, and the sum is used as the passage score. It is also effective to invalidate the passage score when the most positive score and the most negative score do not satisfy the minimum value condition.

Finally, the passage selection process (S108) and the passage output process (S109) will be described. FIG. 17 is a diagram illustrating a configuration relating to passage selection and passage output in the passage extraction device. The passage extraction apparatus includes a pure focus search document storage unit 402, a pure conflict search document storage unit 403, a duplicate search document storage unit 404, and a passage selection unit 1701 and a passage output unit 1702 in addition to the passage table 1302.

The passage selection unit 1701 reads the passage score from the passage table 1302 and identifies the maximum passage score. Then, the passage range (document ID, start sentence ID, end sentence ID) of the passage score is read. The passage output unit 1702 reads a passage in the passage range (document ID, start sentence ID, end sentence ID) from the search document storage unit and outputs the passage. As the output form, display, printing, transmission, storage in a storage medium, and the like are assumed. When outputting a plurality of passages, the passages for the plurality of passages are specified and output in descending order of the passage score.

Embodiment 2. FIG.
In the above-described example, an example in which both a positive characteristic and a negative characteristic are indicated by a single positive / negative score has been described. However, a positive score indicating a positive characteristic and a negative score indicating a negative characteristic are separately provided as word scores. You can also. In this example, there are two word scores: “positive side score = pure focused search document frequency−pure conflict search document frequency” and “negative side score = pure conflict search document frequency−pure focused search document frequency”.

FIG. 18 is a diagram illustrating a configuration relating to word score calculation in the passage extraction apparatus according to the second embodiment. In this example, the word score calculation unit 1101 stores a positive score and a negative score in the word score table 1102 for each word.

The word score calculation process (S105) in this embodiment will be described. FIG. 19 is a diagram showing a word score calculation processing flow in the second embodiment. In this example, the net conflict search document frequency is subtracted from the net focused search document frequency to obtain a difference in document frequency (S1902), and the word frequency is correlated with the word ID as a positive score (word score). It is stored in the score table 1102 (S1903). Further, the net focused search document frequency is subtracted from the pure conflict search document frequency to obtain another document frequency difference (S1904), and another document frequency difference is also associated with the word ID as a negative score (word score). It memorize | stores in the word score table 1102 (S1905).

The passage score calculation process (S107) in this embodiment will be described. FIG. 20 is a diagram showing a passage score calculation processing flow in the second embodiment. Of the words included in the passage, the maximum positive score is determined (S2003), the maximum positive score is set as the highest positive score (S2004), and the maximum negative score is determined among the words included in the passage. Then, the passage score is obtained using the maximum negative score as it is as the maximum negative score (S2006, S2007).

Embodiment 3 FIG.
In the first embodiment, the difference between the search document frequencies of each word is used as the word score. However, the search document frequency ratio is calculated by dividing the search document frequency of each word by the number of documents, and the difference of the search document frequency ratios. It is also effective to use as a word score. In this example, the word score is “positive score = (pure target search document frequency / pure target search document number) − (pure conflict search document frequency / pure conflict search document number)”.

FIG. 21 is a diagram showing a word score calculation processing flow in the third embodiment. For each word (S2101), the net focus search document frequency is divided by the net focus search document number to obtain the net focus search document frequency ratio (S2102), and the net conflict search document frequency is further divided by the net conflict search document number. Then, the ratio of the pure conflict search document frequency is obtained (S2103). Then, the ratio of the net conflict search document frequency is subtracted from the ratio of the net focused search document frequency to obtain a difference in the document frequency ratio (S2104), and the difference in the document frequency ratio is set as a positive score (word score). It is stored in association with the word ID (S2105).

When the number of net-focused search documents and the number of net-conflict search documents are greatly different, the degree of contribution to the score per document between both documents can be made uniform.

Embodiment 4 FIG.
It is also conceivable to use a difference in the ratio of search document frequencies between the positive score and the negative score. In this example, the word score is “positive score = (pure focused search document frequency / pure focused search document count) − (pure conflict search document frequency / pure conflict search document count)” and “negative score = (pure conflict search document). Document frequency / number of pure conflict search documents) − (pure focus search document frequency / pure focus search document count) ”.

FIG. 22 is a diagram showing a word score calculation processing flow in the fourth embodiment. For each word (S2201), as described above, the net focus search document frequency is divided by the net focus search document number to obtain the net focus search document frequency ratio (S2202), and the net conflict search document frequency is determined as the net conflict search document. By dividing by the number, the ratio of the pure conflict search document frequency is obtained (S2203). Then, the ratio of the pure confrontation search document frequency is subtracted from the ratio of the net focused search document frequency to obtain the difference in the document frequency ratio (S2204), and the difference in the document frequency ratio is set as an affirmative score (word score). And stored in association with the word ID (S2205). Further, the ratio of the net focus search document frequency is subtracted from the ratio of the net conflict search document frequency, and a difference in the document frequency ratio is obtained separately (S2206). The difference in the document frequency ratio obtained separately is stored as a negative score (word score) in association with the word ID (S2207).

Embodiment 5 FIG.
In the second embodiment, the difference in document frequency is used as the word score, but the ratio of document frequencies can also be used as the word score. In this example, the word score is “positive score = pure focus search document frequency / pure conflict search document frequency”. This word score shows both the positive and negative characteristics, which are contradictory alone, as extreme values of infinity and zero.

FIG. 23 is a diagram showing a word score calculation processing flow in the fifth embodiment. For each word (S2301), the net focus search document frequency is divided by the net conflict search document frequency to obtain a document frequency ratio (S2302), and the document frequency ratio is set as a positive score (word score) and associated with the word ID. Store (S2303). Then, the process ends when all the words are processed (S2304).

FIG. 24 is a diagram showing a passage score calculation processing flow in the fifth embodiment. Among words included in the passage, the maximum positive score is determined (S2403), and the maximum positive score is set as the highest positive score (S2404). Furthermore, among the words included in the passage, the minimum positive score is determined (S2405), and the reciprocal of the minimum positive score is set as the maximum negative score (S2406). The most negative score is multiplied by the most negative score, and the product is stored as a passage score (S2407). Alternatively, the most negative score is added to the most positive score, and the sum is stored as a passage score (S2407). As before, a minimum value condition can be imposed.

Embodiment 6 FIG.
When the ratio is used as an index, the value becomes infinite when the parameter is 0. Therefore, it is effective to prevent the index from being maximized by adding a constant to the parameter. In this example, the word score is “positive score = (pure focus search document frequency + constant) / (pure conflict search document frequency + constant)”.

FIG. 25 is a diagram showing a word score calculation processing flow in the sixth embodiment. For each word, a constant is added to the net focused search document frequency (S2502), a constant is added to the pure conflict search document frequency (S2503), and the added net focused search document frequency is divided by the net conflict search document frequency to obtain the document. A frequency ratio is obtained (S2504). The document frequency ratio is stored as a positive / negative score (word score) in association with the word ID (S2505). Processing is completed for all words (S2506). For example, “1” which is the minimum unit of frequency is used as the constant.

The passage score calculation process of this embodiment is as shown in FIG.

Embodiment 7 FIG.
As in the sixth embodiment, in order to prevent the maximization of the index due to the ratio, “positive score = pure focus search document frequency / (pure conflict search document frequency + constant)” and “negative score = pure conflict search document frequency”. It is also conceivable to use two word scores of “/ (pure focus search document frequency + constant)”.

FIG. 26 is a diagram showing a word score calculation processing flow in the seventh embodiment. A constant is added to the pure conflict search document frequency (S2602), and the pure focus search document frequency is divided by the added pure conflict search document frequency to obtain a document frequency ratio (S2603). Then, this document frequency ratio is stored as an affirmative score (word score) in association with the word ID (S2604). Further, a constant is added to the net focus search document frequency (S2605), and the net conflict search document frequency is divided by the added net focus search document frequency to obtain another document frequency ratio (S2606). Then, another document frequency ratio is stored as a negative score (word score) in association with the word ID (S2607).

Embodiment 8 FIG.
In the calculation of the word score, an example will be described in which the word score is multiplied by the total search document frequency in order to reflect the global importance of the word with respect to the entire search document. In this example, the word score is “positive score = (pure search document frequency * total search document frequency) / (pure conflict search document frequency + constant)” and “negative score = (pure conflict search document frequency * full search). Document frequency) / (pure focus search document frequency + constant) ”.

FIG. 27 is a diagram showing a word score calculation processing flow in the eighth embodiment. A constant is added to the pure conflict search document frequency (S2702), the pure focus search document frequency is divided by the added pure conflict search document frequency to obtain a document frequency ratio (S2703), and the total search document frequency is added to the document frequency ratio. Multiply (S2704). The accumulated document frequency ratio is stored as an affirmative score (word score) in association with the word ID (S2705). Further, a constant is added to the net focus search document frequency (S2706), and the net conflict search document frequency is divided by the added net focus search document frequency to obtain another document frequency ratio (S2707). Multiply the frequency of all search documents (S2708). Then, the accumulated another document frequency ratio is stored as a negative score (word score) in association with the word ID (S2709).

It is also effective to multiply the above-mentioned other word scores (Embodiments 1 to 7) by the total search document frequency.

Embodiment 9 FIG.
In the above-described embodiment, the passage score is calculated based on the maximum value of the word score. However, it is also possible to define a word having a certain or higher word score as a feature word and to determine the passage score based on the number of the feature words.

FIG. 28 is a diagram showing an overall processing flow in the ninth embodiment. Following the word score calculation process (S105), a feature word determination process (S2801) is performed. In the feature word determination process (S2801), words having strong characteristics are classified as feature words based on the word score.

FIG. 29 is a diagram illustrating a configuration relating to feature word determination in the passage extraction device. The passage extraction device includes a feature word determination unit 2901, an affirmative side feature word table 2902, a negative side feature word table 2903, and a topic feature word table 2904, in addition to the notice statement storage unit 202, the word table 602, and the word score table 1102. I have. The positive feature word is a phrase that represents a topic that is not related to the conflicting statement but only related to the statement of interest, and the negative feature word is a word that represents a topic that is related to the statement of interest but not related to the statement of interest. is there. The topic feature word is a phrase representing a topic common to the statement of interest and the conflict statement. The feature word table stores a set of these feature words.

The feature word determination when using a positive / negative score will be described. FIG. 30 is a diagram showing a feature word determination processing flow (No. 1). First, the content word of the statement of interest is extracted (S3001). Subsequently, the following processing is repeated for each word (S3002). If the positive / negative score is larger than the positive threshold (S3003), the word is stored in the positive feature word table 2902 as a positive feature word (S3004). On the other hand, when the positive / negative score is smaller than the negative threshold (S3005), the negative characteristic word is stored in the negative characteristic word table 2903 (S3006). It is determined whether or not a word that does not correspond to any content word (S3007), and if it matches the content word, it is stored in the topic feature word table 2904 as a topic feature word (S3008). Has this processing been performed for all words (S3009)?

The feature word determination when using the positive score and negative score will be described. FIG. 31 is a diagram showing a feature word determination processing flow (2). When the positive score is larger than the positive threshold (S3103), the word is stored in the positive feature word table 2902 as a positive feature word (S3104). On the other hand, when the negative score is larger than the negative threshold (S3105), the word is stored in the negative feature word table 2903 as a negative feature word (S3106).

FIG. 32 is a diagram showing a passage score calculation processing flow in the ninth embodiment. It is determined whether each positive-side feature word is included in the passage, and the number of positive-side feature words that appear is obtained (S3203). It is determined whether each negative feature word is included in the passage, and the number of negative feature words that appear is obtained (S3204). It is determined whether each topic feature word is included in the passage, and the number of topic feature words that appear is obtained (S3205). Then, the number of appearing positive feature words, the number of appearing negative feature words, and the number of appearing topic feature words are integrated, and the product is taken as a passage score (S3206). Alternatively, the number of positive-side feature words that appear, the number of negative-side feature words that appear, and the number of topic feature words that appear are added, and the sum is taken as a passage score (S3206). In addition, by adding the number of topic feature words that appear to the product of the number of positive feature words that appear and the number of negative feature words that appear, the sum is taken as the passage score, or the number of positive feature words that appear A method may be considered in which the sum of the number of negative feature words that appear is multiplied by the number of topic feature words that appear and the product is used as a passage score. Passage is the product of the number of positive feature words that appear and the number of negative feature words that appear, or the sum of the number of positive feature words that appear and the number of negative feature words that appear, without using the number of topic feature words It can also be a score. The number of feature words is the number of feature words, that is, the number of feature words.

Embodiment 10 FIG.
In this embodiment, for each feature word, the number of sentences including the feature word among the sentences included in the passage is calculated and used as the appearance frequency of the feature word. A passage score is set using the frequency of appearance of the feature words.

FIG. 33 is a diagram showing a passage score calculation process flow according to the tenth embodiment. For each positive-side feature word, the number of sentences including the word among the sentences included in the passage is calculated, and the appearance frequency of the positive-side feature word is calculated (S3303). For each negative side feature word, the number of sentences including the word among the sentences included in the passage is calculated, and the appearance frequency of the negative side feature word is calculated (S3304). For each topic feature word, the number of sentences including the word among the sentences included in the passage is calculated and used as the appearance frequency of the topic feature word (S3305). Then, the total number of appearance frequencies of all positive side feature words, the total number of appearance frequencies of all negative side feature words, and the total number of appearance frequencies of all topic feature words are integrated, and the product is taken as a passage score (S3306). Alternatively, the total number of appearance frequencies of all positive side feature words, the total number of appearance frequencies of all negative side feature words, and the total number of appearance frequencies of all topic feature words are added, and the sum is taken as a passage score (S3306). Other methods include adding the total number of appearance frequencies of all topic feature words to the product of the total number of appearance frequencies of all positive feature words and the total number of appearance frequencies of all negative feature words, and using the sum as the passage score. Also, consider the method of multiplying the total number of appearance frequencies of all positive feature words and the total appearance frequency of all negative feature words by the total number of appearance frequencies of all topic feature words, and using that product as the passage score. It is done. Without using the total occurrence frequency of all topic feature words, the product of the total occurrence frequency of all positive feature words and the total occurrence frequency of all negative feature words, or the total occurrence frequency of all positive feature words It is also possible to use the sum of the total number of appearances of all negative side feature words as the passage score.

Embodiment 11 FIG.
It is also possible to assign a rank based on the word score and determine the feature word using the rank.

FIG. 34 is a diagram showing a feature word determination processing flow in the eleventh embodiment. Each word is ranked by the positive score (S3402), each word is ranked by the negative score, and for each word (S3404), the difference obtained by subtracting the positive score rank from the negative score rank is greater than the rank difference threshold. If larger (S3405), it is stored as an affirmative feature word (S3406). On the other hand, when the difference obtained by subtracting the negative score rank from the positive score rank is larger than the rank difference threshold (S3407), it is stored as a negative feature word (S3408). Regardless of the difference in rank, an affirmative score word can be used when the affirmative score rank is smaller than the rank threshold, and a negative feature word when the negative score rank is smaller than the rank threshold.

Embodiment 12 FIG.
In the present embodiment, an example will be described in which a sentence score is calculated based on the number of appearances of feature words for a sentence, and a passage score is obtained based on the sentence score.

FIG. 35 is a diagram showing an overall processing flow in the twelfth embodiment. Subsequent to the feature word determination process (S2801), a sentence score calculation process (S3501) is performed.

The sentence score calculation process (S3501) will be described. FIG. 36 is a diagram showing a sentence score calculation processing flow. The following processing is repeated for each sentence included in each search document (S3601). For each feature word of the affirmative feature word, negative feature word, and topic feature word (S3602), it is determined whether or not the feature word is included in the sentence (S3603). 1 is added to the value 0) (S3604). By processing this for all feature words (S3605), the number of occurrences of the feature word is obtained, and this is stored as a score of the sentence in the sentence score storage unit in association with the sentence ID (S3606). The feature word appearance count is obtained for all sentences, and the process ends (S3607).

The passage score calculation process (S106) will be described. FIG. 37 is a diagram showing a passage score calculation process flow according to the twelfth embodiment. For each passage (S3701), the maximum sentence score is determined from the sentence scores included in the passage (S3702), and the maximum sentence score is set as the passage score (S3703). Then, processing is completed for all passages (S3704).

Embodiment 13 FIG.
A mode in which adjustment is performed to increase the score of a passage including a positive side feature word, a negative side feature word, and a topic feature word will be described.

FIG. 38 is a diagram showing a passage score calculation process flow according to the thirteenth embodiment. For each passage (S3801), the maximum sentence score among the sentences included in the passage is determined in the same manner as described above (S3802). Then, it is determined whether the passage includes a positive feature word, a negative feature word, and a topic feature word (S3803), and any positive feature word, any negative feature word, and any topic. When all the feature words are included, the maximum sentence score is multiplied by the bonus coefficient, and the product is used as the passage score (S3804). The bonus coefficient is a value greater than 1. If at least one of the positive side feature word, the negative side feature word, and the topic feature word is not included, the maximum sentence score is set as the passage score without multiplying by the bonus coefficient (S3805). This is processed for all passages, and the process ends (S3806).

Embodiment 14 FIG.
In the sentence score calculation, a form in which the sentence score is adjusted by a combination of feature words appearing in the sentence will be described.

FIG. 39 is a diagram showing a sentence score calculation processing flow in the fourteenth embodiment. Similar to the above, for each sentence included in each search document (S3901), the number of occurrences of feature words is calculated (S3902). If the sentence is an insufficient sentence or an abbreviated sentence (S3903), the number of feature words is multiplied by a penalty coefficient (S3904). The penalty coefficient is a value smaller than 1. When a topic feature word is not included (S3905) and only a topic feature word is included (S3905, S3906, S3907), the number of feature words appearing without being multiplied by a coefficient is used as the score of the sentence (S3911). When a topic feature word and an affirmative side feature word are included (S3905, S3906) and a negative side feature word is not included (S3908), the number of feature word occurrences is multiplied by a low bonus coefficient (S3909), and the product is The sentence score is set (S3911). The low bonus coefficient is a value greater than one. When the topic feature word is included (S3905), the positive side feature word is not included (S3906), and the negative side feature word is included (S3907), the low bonus coefficient is multiplied (S3909), and the product is used as the score of the sentence. (S3911). When a topic feature word, a positive feature word, and a negative feature word are included (S3905, S3906, S3908), the number of feature words is multiplied by a high bonus coefficient (S3910), and the product is used as the score of the sentence. (S3911). The high bonus coefficient is larger than the low bonus coefficient. Processing is completed for all sentences (S3912).

Embodiment 15 FIG.
A mode of setting a significant passage range with a variable length based on the sentence score will be described.

FIG. 40 is a diagram showing a passage range setting process flow in the fifteenth embodiment. For each document (S4001), the sentence score included in the document is smoothed (S4002). For example, the sentence score in a predetermined range (in the window) before and after the target sentence is multiplied by a coefficient corresponding to the distance to the target sentence, and the products obtained for each sentence are added together to smooth the sentence score. Get. In general, a high coefficient is used for a sentence close to the target sentence, and a low coefficient is used for a sentence far from the target sentence. The simplest is to use the average of the sentence scores in the window. Then, the maximum sentence score in the document is specified, and a series of consecutive sentences having a sentence score equal to or higher than the reference is used as a passage with a predetermined ratio (for example, 1 / N, N> 1) of the maximum sentence score as a reference Specify (S4003). Then, the document ID, the series of start sentence IDs and the end sentence IDs thereof are stored in association with the passage ID (S4004). This is performed for all documents (S4005).

Embodiment 16 FIG.
For each word included in the net focus search document that was searched with the focus statement and not searched with the conflict statement, the net focus search document frequency is obtained by calculating the number of net focus search documents including the word, and the conflict statement For each word included in a pure conflict search document that has been searched in and not searched for in the statement of interest, instead of or in addition to obtaining the pure conflict search document frequency by calculating the number of pure conflict search documents that include the word For each word included in the net focused search document set that has been searched with the focused statement and not searched with the conflict statement, the pure focused search document set is calculated by calculating the number of times the word appears in the pure focused search document set. For each word included in the pure conflict search document set that was searched with the conflict statement but not searched with the statement of interest, the word appears in the pure conflict search document set. It is also effective to determine the frequency of a word appearing in pure conflict search document set by calculating the number of times.

In this embodiment, instead of or in addition to the document frequency calculation by the document frequency calculation unit 601 (S104), the word frequency calculation processing in the document set by the word frequency calculation unit in the document set is performed.

In the word frequency calculation process within the document set, the word frequency calculation process within the pure focused search document set, the word frequency calculation process within the pure conflict search document set, and the word frequency calculation process within the duplicate search document set are sequentially performed. The word frequency in the net focused search document set is the number of times (frequency) that the target word appears in the net focused search document set. Similarly, the word frequency in the pure conflict search document set is the number of times (frequency) that the target word appears in the pure conflict search document set, and the word frequency in the duplicate search document set is the duplicate search document set in which the target word is duplicated. The number of occurrences (frequency). Furthermore, the word frequency in the entire search document set is the number of times (frequency) that the target word appears in the entire set of the net search document, the net conflict search document, and the duplicate search document.

In the pure focus search document set word frequency calculation process, in S805 of FIG. 8, the number of words that appear in the pure focus search document is calculated, and the pure search document set word frequency and the total search document set word frequency. To the number of words. Others are the same as FIG. The word frequency in the net focused search document set and the word frequency in all search document sets have an initial value of 0.

In the pure conflict search document set word frequency calculation process, in S905 of FIG. 9, the number of words that appear in the conflict search document is calculated, and the pure conflict search document set word frequency and the total search document set word frequency are calculated. Add the number of words. Others are the same as FIG. The word frequency in the pure conflict search document set and the word frequency in all search document sets have an initial value of 0.

In the duplicate search document set word frequency calculation process, in S1005 of FIG. 10, the number of times of the word appearing in the duplicate search document is calculated, and the duplicate search document set word frequency and the total search document set word frequency are Add the number of words. Others are the same as FIG. The word frequency in the duplicate search document set and the word frequency in all search document sets are 0 at the initial value.

In the word score calculation process (S105) in the word score calculation unit 1101, the word frequency in the net focused search document set is used instead of or in addition to the pure focused search document frequency, and instead of or combined with the pure conflict search document frequency. , Using word frequency in pure conflict search document set, using word frequency in duplicate search document set instead of or in combination with duplicate search document frequency, and word in all search document set in place of or in addition to full search document frequency Use frequency.

In addition to a method of calculating a word score by replacing each search document frequency with a word frequency in each search document set, a first intermediate word score based on each search document frequency and a second intermediate word based on each search document set word frequency It is also possible to calculate the final word score based on the first intermediate word score and the second intermediate word score by obtaining each score. For example, the first intermediate word score and the second intermediate word score are added, and the sum is used as the final word score. In that case, it is also conceivable to weight the first intermediate word score and the second intermediate word score. There is also a method of integrating the first intermediate word score and the second intermediate word score to obtain the product as the final word score.

The passage extraction device is a computer, and each element can execute processing by a program. Further, the program can be stored in a storage medium so that the computer can read the program from the storage medium.

The hardware configuration of the passage extraction device will be described. FIG. 41 is a diagram illustrating a hardware configuration of the passage extraction device. An arithmetic device 4101, a data storage device 4102, a memory 4103, a communication interface 4104, a data input device 4105, and a data output device 4106 are connected to the bus. The data storage device 4102 is, for example, a ROM (Read Only Memory) or a hard disk. The memory 4103 is a normal RAM (Random Access Memory). The program is stored in the normal data storage device 4102 and is sequentially read into the arithmetic device 4101 for processing while being loaded in the memory 4103. The communication interface 4104 is used for communication via a network. The data input device 4105 is used for data input. The data output device 4106 is used for outputting data. Note that the program may be stored in a server on a network connected to the communication interface 4104 and loaded into the memory 4103 at the time of execution.

DESCRIPTION OF SYMBOLS 201 Focus statement input part 202 Focus statement memory | storage part 203 Opposition statement specific | specification part 204 Conflict statement memory | storage part 401 Statement related document search part 402 Pure focus search document memory | storage part 403 Pure conflict search document memory | storage part 404 Duplicate search document memory | storage part 601 Document frequency calculation Section 602 Word table 1101 Word score calculation section 1102 Word score table 1301 Passage range determination section 1302 Passage table 1501 Passage score calculation section 1701 Passage selection section 1702 Passage output section 2901 Feature word determination section 2902 Positive side feature word table 2903 Negative side feature word Table 2904 Topic feature word table 4101 Arithmetic unit 4102 Data storage unit 4103 Memory 4104 Communication interface 4105 Data input unit 410 Data output device

Claims

A passage extraction apparatus that extracts a passage including affirmative contents and negative contents for a statement of interest indicating matters for judging authenticity from a search document, and has the following elements: (1) Statement of attention Statement-of-interest input unit for inputting (2) conflict-related statement specifying unit for identifying conflict statement indicating content opposite to statement of interest (3) statement-related document for retrieving documents based on the statement of interest and retrieving documents based on the conflict statement Retrieval unit (4) For each word included in a net focus search document that has been searched with a focus statement and not searched with a conflict statement, the net focus search document frequency is calculated by calculating the number of net focus search documents that include the word. For each word included in the pure conflict search document that was searched with the conflict statement and not searched with the statement of interest, the number of pure conflict search documents including the word is calculated. Seek confrontation search document frequency,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set for each word. A word score calculation unit that calculates a word score indicating a positive characteristic and a negative characteristic with respect to a statement of interest of the word based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure conflict search document set 6) A passage score calculation unit (7) that calculates, for each passage, a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest of the passage based on the word score of the word included in the passage. A passage output unit that outputs passages based on the passage score.
The passage extraction device further
A feature word determination unit that determines, based on the word score, an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion, and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
For each sentence included in the document searched with the statement of interest and / or the document searched with the conflict statement, the sentence score is obtained by counting the number of positive feature words and negative feature words included in the sentence. Have a calculator,
The passage extraction apparatus according to claim 1, wherein the passage score calculation unit sets a maximum sentence score among sentences included in the passage as a passage score.
The feature word determination unit determines a topic feature word that is a content word of a statement of interest that does not correspond to a positive feature word and a negative feature word,
The passage extraction device according to claim 2, wherein the sentence score calculation unit counts the number of topic feature words included in the sentence and adds it to the sentence score.
The passage score calculation unit is configured to multiply the maximum sentence score by a bonus coefficient larger than 1 to obtain a passage score when the passage includes a positive feature word, a negative feature word, and a topic feature word. The passage extraction device according to claim 3.
The sentence score calculation unit multiplies the sentence score by a low bonus coefficient greater than 1 when the passage includes either a positive feature word or a negative feature word and a topic feature word, and passes the positive feature word to the passage. 4. The passage extracting apparatus according to claim 3, wherein, when both the negative feature word and the negative feature word are included, the sentence score is multiplied by a high bonus coefficient larger than the low bonus coefficient.
The passage extraction device according to claim 2, further comprising a passage range setting unit for setting a passage range based on a sentence score.
The passage extraction device further
Based on the word score, it has a feature word determination unit that determines an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
The passage extraction device according to claim 1, wherein the passage score calculation unit calculates a passage score based on the number of positive side feature words and the number of negative side feature words included in the passage.
The feature word determination unit determines a topic feature word that is not a positive feature word and a negative feature word but is a content word of the statement of interest,
The passage extraction device according to claim 7, wherein the passage score calculation unit further calculates a passage score based on the number of topic feature words included in the passage.
The passage extraction device further
Based on the word score, it has a feature word determination unit that determines an affirmative feature word whose affirmative characteristic for the statement of interest is higher than a predetermined criterion and a negative feature word whose negative characteristic for the statement of attention is higher than a predetermined criterion;
The passage score calculation unit calculates, for each affirmative feature word, the number of sentences including the affirmative feature word among sentences included in the passage, and obtains the appearance frequency of the affirmative feature word, and the negative feature For each word, out of the sentences included in the passage, the number of sentences including the negative feature word is counted to obtain the appearance frequency of the negative feature word, and the appearance frequency of the positive feature word and the negative feature word The passage extraction apparatus according to claim 1, wherein a passage score is calculated based on the appearance frequency.
The feature word determination unit determines a topic feature word that is a content word of a statement of interest that does not correspond to a positive feature word and a negative feature word,
The passage score calculation unit further calculates, for each topic feature word, the number of sentences including the topic feature word among the sentences included in the passage, and obtains the appearance frequency of the topic feature word. The passage extraction apparatus according to claim 9, wherein a passage score is calculated based on the appearance frequency.
The feature word determination unit obtains the ranking of the affirmative characteristic with respect to the focused statement and the ranking of the negated characteristic with respect to the focused statement for each word based on the word score, and determines the ranking of the affirmative characteristic with respect to the focused statement and the negation with respect to the focused statement The passage extraction device according to claim 7 or 9, wherein a positive side characteristic word and a negative side characteristic word are determined based on the ranking of the characteristics.
A passage extraction method by a passage extraction device for extracting a passage including affirmative contents and negative contents for a statement of interest indicating matters for judging true / false from a search document, and having the following elements: 1) A focused statement input step for inputting a focused statement (2) A conflict statement specifying step for identifying a conflict statement indicating the content opposite to the focused statement (3) A document search based on the focused statement, and a document search based on the conflicted statement Statement-related document search step (4) For each word included in the net focus search document that is searched with the focus statement and not searched with the conflict statement, the number of net focus search documents including the word is calculated. For each word included in a pure conflict search document that is searched for with a conflict statement and not searched with a focus statement, the frequency of the target search document is obtained. Determine the net confrontation search document frequency by calculating the number of documents,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set Based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure confrontation search document set, the word score calculation that calculates the word score indicating the positive characteristic and the negative characteristic with respect to the statement of interest of the word (6) For each passage, a passage score calculation step of calculating a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest of the passage based on the word score of the word included in the passage. 7) A passage output step of outputting a passage based on the passage score.
A program for causing a computer to be a passage extraction device to extract a passage including a positive content and a negative content with respect to a statement of interest indicating a matter to judge true / false from a search document. (1) A statement of interest is input. Statement-of-interest input procedure (2) Conflict-statement specifying procedure for specifying an opposition statement indicating the content opposite to the statement-of-interest (3) Statement-related document search procedure for searching a document based on the statement of interest and searching for a document based on the opposition statement ( 4) For each word included in the net focus search document that is searched with the focus statement and not searched with the conflict statement, the number of net focus search documents including the word is calculated to obtain the net focus search document frequency. For each word included in a pure conflict search document that has been searched with a conflict statement and not searched with a statement of interest, the number of pure conflict search documents that include the word is calculated. Ri determine the net confrontation search document frequency,
In addition, for each word included in the net-focused search document set that is searched with the focus statement and not searched with the conflict statement, the net focus is calculated by calculating the number of times the word appears in the net-focus search document set. The number of times that the word appears in the pure conflict search document set for each word included in the pure conflict search document set that has been searched with the conflict statement and not searched with the statement of interest, by determining the frequency of the words that appear in the search document set Frequency calculation procedure for calculating the frequency of words appearing in a pure conflict search document set by calculating (5) For each word, the net focus search document frequency and the net conflict search document frequency of the word and / or the net focus search document set Based on the frequency of the word appearing in the word and the frequency of the word appearing in the pure confrontation search document set, the word score calculation that calculates the word score indicating the positive characteristic and the negative characteristic with respect to the statement of interest of the word In order (6), for each passage, a passage score calculation procedure for calculating a passage score indicating the degree of parallelism for both positive and negative characteristics with respect to the statement of interest of the passage based on the word score of the word included in the passage. 7) A passage output procedure for outputting a passage based on the passage score.