WO2008023470A1

WO2008023470A1 - Sentence search method, sentence search engine, computer program, recording medium, and document storage

Info

Publication number: WO2008023470A1
Application number: PCT/JP2007/055448
Authority: WO
Inventors: Shun Shiramatsu; Kazunori Komatani; Hiroshi Okuno
Original assignee: Kyoto University
Priority date: 2006-08-21
Filing date: 2007-03-16
Publication date: 2008-02-28
Also published as: JP5167546B2; JPWO2008023470A1

Abstract

A computer for executing a sentence search method sorts document data on a set of documents into sentences in advance. Information representing the cohesion of the meaning reflecting the flow of the context from a previous sentence to another sentence, namely, a weighted group of words in which a weight value is given to each word of one sentence is associated with each sentence, and the sentences and the associated weighted word groups are stored. When the computer receives a word, the computer acquires information representing the cohesion of the meaning in the flow of the uttered conversation, namely, a weighted word group in which a weight value is given to each word and associates them, extracts a sentence similar to the cohesion of the meaning according to the weighted word group associated with the word, and outputs the sentence as a search result. The weight value given to each word may be a value reflecting the influence according to the weight value of the related word in the sentence and the degree of relation between the related word and each word.

Description

Specification

Sentence search method, sentence search device, computer program, recording medium, and document storage device

Technical field

The present invention relates to a search method for searching a large number of document data and searching for a document collective power based on words such as text and voice received by the user for searching. In particular, sentence units that can be directly searched for sentence units whose meanings are similar to accepted words from sentence units that are groups of meanings in a document whose meaning changes dynamically in the context flow The present invention relates to a retrieval method, a sentence unit retrieval apparatus, a computer program that causes a computer to function as the sentence unit retrieval apparatus, a computer-readable recording medium that records the computer program, and a document storage apparatus.

Background art

[0002] For various services provided on the Internet, documents that are searched for related documents from documents published on the Internet and output as a list based on the key words or sentences input by the user There is a search service.

[0003] Conventional document search services include the following. Documents published on the Internet are automatically collected and stored, and for each document, words appearing in the document are stored together with the appearance probability in the document, and words such as keywords or sentences are accepted. In such a case, the document is extracted by assigning priorities in descending order of the probability of occurrence of words included in the keyword or sentence that has received the stored document collective power, and the sentence or sentence including the word is extracted from the extracted document. Output paragraph.

[0004] A user who uses a document search service needs to think about keywords related to searching information he or she wants to know. In recent document retrieval services, it is sometimes possible to accept a natural sentence as an input sentence, morphologically analyze the input sentence, identify a keyword of the input sentence, and automatically create a search request.

[0005] Also, in the document search service, even when natural text input is accepted, words included in the input sentence are extracted, and documents containing the extracted words are output as search results. To do. Therefore, the user needs to further narrow down the keyword by inputting further keywords related to the keyword to be entered or a word whose meaning of the inputted keyword is changed in order to obtain a target search result. For example, simply “president” needs to add the keyword “president, USA” because it is unknown which country the president is. Furthermore, depending on what the American president wants to examine, it is necessary to consider information that makes it easy to obtain search results, such as “President, United States, native”, “President, US, policy”.

[0006] Therefore, in order to actually obtain a search result that the user wants to obtain, the user needs to think about a combination of keywords and try many times. For example, if a user wants to know the information that “the US president is the power of what to do if there is an economic problem with another country,” “US, president, In “Economy”, a large amount of search results are output, and a large amount of search results are output. The user must select a document. So, for example, add the keyword “policy” to narrow down and enter the keyword “US, president, economy, policy”. In this case, even if the word “policy” is a broad concept, it is narrowed down by the keyword “policy” itself. Therefore, there are documents that contain statements about economic policies. Documents with a low frequency of “policy” may be leaked. In this way, it is difficult for users to obtain search results by trying keywords that will help them reach their search goals. Each time you enter additional information, the content of the search results may deviate from the purpose of the original search.

[0007] In the above example, the user wants to know about economic policies and international policies. Even if the user's input is in natural language, human beings decide which of the words “America, President, other countries, economy, problems, outbreaks, countermeasures” is most important. It can be grasped when reading, but it is difficult to express quantitatively as the amount of information handled by the device or computer. Therefore, although all the keywords are included, it is assumed that a document describing “American economic problems and countermeasures of presidents of other countries” will be output.

[0008] Furthermore, when a document to be searched is very long, the search is based on a word that appears as a unit even though the context dynamically changes in the document. Force S. Therefore, there is a document that describes the history of the president of the United States, the history of the presidents of other countries, the economic structure of each country, and the contents of measures against unemployment in each country, divided into chapters. In this case, it is output as a search result because it contains most of the search keywords. In fact, even if those chapters are not contextually connected, the result of partial extraction of sentences or paragraphs containing keywords will be output. Therefore, it cannot be measured whether the meaning including the influence of the preceding context up to the extracted part and the search intention based on the user's consciousness match semantically.

[0009] On the other hand, there are cases in which the keyword entered for the search is included in the document to be searched, although the keyword entered for the search does not appear frequently but has an important meaning in context. is there. For example, the subject word is expressed with a pronoun or zero pronoun. Therefore, the user who searches for the information he / she wants to know may be the information he / she wants to obtain as a search result, in which the sentence or paragraph in which the keyword input for the search is expressed in the demonstrative pronoun or zero pronoun. However, if priority is given to the search results with the actual appearance frequency, the appearance frequency of the keyword input by the user is low, so it is excluded from the candidates by narrowing down and is not output as the search results.

[0010] Therefore, the word in the document is extracted, and the document is subjected to morphological analysis using the part-of-speech information of the word, the dependency information between the words, and the information specifying the anaphoric relationship with the demonstrative pronoun or zero pronoun. In addition, a technique has been proposed in which a document is retrieved by a device or a computer, a question is answered, and machine translation is performed based on the stored information. ).

[0011] Relationships between words, such as dependency or anaphora, are natural sentences, so the order of phrases is complex, and it is difficult for humans to recognize them mechanically even if the meaning can be determined. . Therefore, in the technique described in Non-Patent Document 1, a relationship such as dependency between words or anaphora is added to document data as information for each sentence or phrase by a tag and stored. Also, especially in the case of Japanese, there are many sentences in which the subject is omitted, so it is necessary to complement the subject when translating mechanically. Therefore, in the technique described in Non-Patent Document 1, supplementary information such as the subject or zero pronoun is added for each sentence. This makes it possible to perform accurate machine translation by using a document with the information added. Omitted in sentence The words represented by the pronoun or zero pronoun can also be used for application techniques such as the calculation of the appearance frequency when searching documents.

Non-Patent Document 1: Hiroshi Hashida “Global Document Modification” The Japanese Society for Artificial Intelligence (11th) Proceedings p p. 62-63 (1997)

Disclosure of the invention

Problems to be solved by the invention

[0012] When writing a sentence or when speaking, the user's attention object (priority object) in each sentence or each utterance dynamically changes according to the context or the context flow of the sentence. In other words, the weight representing the degree of attention to words in conversations and sentences dynamically changes. Therefore, in order to realize a service that retrieves information related to conversations and sentences, it is necessary to track dynamic changes in word weight according to the context.

However, in the conventional document search service, a document with a high frequency of appearance of words input for search is extracted, and a sentence or paragraph including the word is extracted and output from the extracted document. Therefore, the weights that change dynamically in the context of the sentence or paragraph of the word are searched without being considered. Therefore, in the search based on the appearance frequency, although the word input for the search is surely included, the word may not be used as the user thinks in context, thereby achieving the user's search purpose. It is not always possible. It is not possible to specify the weight of each sentence in the contextual meaning of each word, that is, whether or not it is noticed in context. Therefore, it is not possible to output the sentence or paragraph used according to the meaning of the keyword entered by the user.

[0014] In addition, the technology of Non-Patent Document 1 automatically analyzes information that can be identified in the context of grammar, such as part-of-speech information, and supplements, correlates, or depends on demonstrative pronouns or zero pronouns. Information can be added to the document. By adding the information, the noun being referred to can be used as the frequency of appearance, so the relationship between words in sentences or paragraphs can be analyzed with the added information. However, the degree of attention in each sentence or paragraph, ie the manifestation, cannot be measured quantitatively.

[0015] The technology of Non-Patent Document 1 can be applied to the realization of a question response in which a computer responds to a question in a natural sentence in consideration of a word or the like omitted in the question sentence. However, it is easy to calculate the contextual meaning of conversations by multiple users as a quantitative value, and to generate and present utterances according to the user's conversation context as third party utterances. Not.

[0016] Further, in the conventional document search service, even when the frequency of appearance in a document is low, it is not possible to search in consideration of words that represent background knowledge deeply related to the context. Therefore, it is impossible to directly output a sentence or a paragraph that is similarly associated with a word that the user who is searching for is aware of but does not appear as a word input for the search.

[0017] The present invention has been made in view of such circumstances, and for each sentence unit of one or a plurality of sentence powers, a weight value indicating a word manifestation in the sentence unit is assigned. Word words are associated with each other and stored, and words accepted for search are also associated with weighted word groups assigned weight values in the words, and the weighted word groups are similar. The sentence unit is extracted and output. Sentence units in a document whose meaning changes dynamically in the context flow, automatically generating information that reflects the context of the previous word power in the user's consciousness from the received words The sentence unit search method, sentence unit search apparatus, and computer that can directly search sentence units with similar contextual meanings represented by the information generated from the received word power It is an object of the present invention to provide a computer program that functions as a search device, and a computer-readable recording medium that records the computer program.

[0018] An object of the present invention is to refer to the probability or occurrence of a weight value indicating the manifestation of each word in a weighted word group associated with a sentence unit or a received word in the subsequent sentence unit or word. Providing a sentence unit search method and document storage device that can quantitatively represent the manifestation of words that change in time series in sentence units or words in the context flow by calculating as probabilities There is to do.

[0019] Further, an object of the present invention is to generate user power by quantitatively calculating the degree of association with related words and reflecting the degree of association in the manifestation of each word in each sentence unit or word. Even if it does not appear in a written word or written sentence, it is effective to use a sentence unit that reminds the user when he / she utters a word, or when writing or writing. An object of the present invention is to provide a sentence unit retrieval method and a document storage device which can be retrieved. Means for solving the problem

[0020] The sentence unit retrieval method according to the first invention uses a document set in which a plurality of document data composed of natural languages is stored, and the document data obtained from the document set is a sentence that also has one or more sentence strengths. In the sentence unit search method that accepts words and retrieves sentence units that are separated from the document set based on the accepted words while separating them into units! A step of pre-storing a weighted word group consisting of a plurality of words to which a weight value is assigned in units of sentences, and storing a weight value when the word is accepted. A step of associating a weighted word group consisting of a plurality of words and a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with each other. Extract from set It includes a step of extracting similar sentence units to be output and a step of outputting the extracted sentence units.

[0021] In the sentence unit search method according to the second invention, the similar sentence unit extraction step is preliminarily classified from the distribution of weight values of a plurality of words in the weighted word group associated with the received word. A step of determining whether or not a distribution of weight values of a plurality of words in a weighted word group associated with a sentence unit satisfies a predetermined condition and a predetermined condition is determined A step of extracting a sentence unit associated with the weighted word group.

[0022] In the sentence unit search method according to the third invention, the similar sentence unit extraction step includes a word including the same word as the weighted word group associated with the received word from the sentence units sorted in advance. A step of extracting a sentence unit associated with the group; a step of calculating a difference in weight value for each identical word in the word group associated with the received word and the extracted sentence unit; A step of assigning priorities to the extracted sentence units in ascending order of the calculated difference, and the extracted sentence units are output based on the priorities.

[0023] In the sentence unit search method according to the fourth invention, the weighted word group is such that each word is one-dimensional, and the size of the weight value assigned to each word is an element in the dimension direction corresponding to each word. Have A step of calculating as a multidimensional vector, and the step of extracting similar sentence units includes: the multidimensional vector stored for each separated sentence unit; and the multidimensional vector associated with the received word. The method includes a step of calculating a distance and a step of assigning priorities in order of the calculated distance being short V in sentence units, and outputting according to the given priorities.

[0024] In the sentence unit search method according to the fifth aspect of the present invention, when a weighted word group is associated with a sentence unit or an accepted word, each word appears in the sentence unit or a sentence unit or word subsequent to the word. A reference probability calculation step of calculating a reference probability to be referred to or referred to, and the calculated reference probability is assigned as a weight value of each word.

[0025] In the sentence unit search method according to the sixth aspect of the invention, the reference probability calculating step refers to a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or refers to the preceding word unit power of the word. A feature pattern including a pattern to be identified, and a word that identifies the same feature pattern as the feature pattern in the document data obtained from the document collection appears or is referenced in subsequent sentence units. Calculating the ratio, and calculating the calculated ratio as a reference probability.

[0026] In the sentence unit search method according to the seventh invention, for each word extracted from the document set, a specifying step for specifying the feature pattern of the word and a feature pattern identical to the specified feature pattern are provided. A determination step for determining whether the specified word has appeared or referenced in the subsequent sentence unit in the document data, the specified feature pattern, and the determination for the word specified by the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the result of the analysis, and storing or accepting a weighted word group in association with each sentence When associating a weighted word group with a word, the reference probability calculating step specifies a feature pattern of the word in the sentence unit or word for each sentence unit or word, and uses the identified feature pattern. And calculates the reference probability using said regression coefficients.

[0027] In the sentence unit search method according to the eighth invention, for the sentence unit, the first document collective power composed of written words is used to calculate the ratio in the acquired document data. Spoken language ability Second document gathering power Calculate the ratio in the acquired document data It is characterized by that.

[0028] The sentence unit search method according to the ninth invention performs the specifying step, the determining step, and the regression step for each of the first document set made up of written words and the second document set made up of spoken words, The reference probability calculation step calculates a reference probability using the regression coefficient calculated by the regression step performed on the first document set for the feature pattern of the word specified in the sentence unit, For the feature pattern of the word specified by the accepted word, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.

[0029] In the sentence unit search method according to the tenth invention, the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the number of sentence units or words up to the word, the dependency information of the word in the immediately preceding sentence unit or word in which the word appears or is referenced, or the sentence unit or word that contains the word Or the number of times it has been referenced, the noun distinction of the word in the last preceding sentence unit or word in which the word appears or referenced, or in the last preceding sentence unit or word in which the word has appeared or referenced Whether the word is the subject, whether it is the last preceding sentence unit in which the word appears or referenced, whether the word is the subject in the word, the sentence unit in which the word is included, or In words Personal information and sentence units including the word or part-of-speech information in the word.

[0030] In the sentence unit search method according to the eleventh aspect of the invention, the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the time corresponding to the word, the utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or referenced, and the last preceding sentence in which the word appears or referenced It is specified by information including one or more of voice frequencies corresponding to the word in a sentence unit or word.

[0031] The sentence unit search method according to the twelfth aspect of the present invention is the weighted word group associated with the sorted sentence unit by one word among the words extracted from the sentence set. A word group including the one word, and the weight value of the one word is predetermined. The first step of extracting a word group that is greater than or equal to the value, and the value obtained by integrating the word weight values of the word group extracted in the first step for each word is defined as the degree of relevance of the one word to each word. A second step of creating the related word group assigned in step 3, a third step of storing the created related word group in association with the one word, and the first to third steps for each of the extracted words Each word of the related word group stored in association with each word, the weight value of each word of the weighted word group associated with each sentence unit or each accepted word. And a relevance addition step for re-assigning using the relevance level.

[0032] In the sentence unit search method according to the thirteenth invention, in the second step, with respect to the extracted word group, the weight value of each word included in each word group is weighted by the weight value of the one word. A step of calculating the added sum, a step of averaging the calculated sum, and an average sum of weight values of each word is given as the relevance of each word of the related word group to be created And a step.

[0033] In the sentence unit search method according to the fourteenth invention, the relevance adding step stores each word of the weighted word group associated with each sentence unit or each accepted word in association with each word. Multiplying the degree of relevance of each word included in the related word group by the weight value of each word of the weighted word group, and each word of the weighted word group based on the multiplication result And a step of reassigning as a weight value.

[0034] The sentence unit search method according to the fifteenth aspect of the present invention relates to the related word group for each word, wherein each word is one dimension, and the degree of relevance given to each word is a dimension corresponding to each word. Calculating as a multidimensional relevance vector having a direction element, and the relevance adding step described above uses the multidimensional vector stored for each classified sentence unit as the relevance vector of each word. It is characterized by converting according to a column.

[0035] The sentence unit search method according to the sixteenth invention uses a document set in which a plurality of document data consisting of natural language is stored, accepts words, and retrieves the document set based on the accepted words. According to the search method, the step of separating the document data obtained from the document set into sentence units having one or more sentence powers, A step of extracting a word that appears in a sentence unit or a word to be referred to in the preceding sentence unit in document data, and for each word extracted for the sentence unit, a feature in each sentence unit is specified and stored. A step of placing, referring to a pattern of the combination of the features when a word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the preceding sentence unit power A step of specifying a feature pattern including a reference pattern, storing a specified feature pattern and whether or not a word specified in the feature pattern has appeared or referred to in a subsequent sentence unit, A feature pattern is obtained by performing regression analysis of the reference probability that a word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained by the resultant force. Step of executing regression learning to obtain the corresponding regression coefficient, for each sentence unit, each word extracted up to each sentence unit in the document data is identified in the sentence unit. A step of calculating the reference probability of the word using the regression coefficient corresponding to the feature pattern, a step of preliminarily storing a weighted word group assigned with the calculated reference probability, If accepted, a step of storing words in the order of acceptance; if a word is accepted, extracting a word that appears in the accepted word or a word that also refers to the word power received earlier than the word; A step of identifying features in the accepted words, a pattern of combinations of features when appearing in previously accepted words, or a first accepting The step of specifying a feature pattern including a reference pattern when referring to, the step of calculating the reference probability of the word using the regression coefficient corresponding to the specified feature pattern, and the calculated reference A step of associating a weighted word group to which probabilities are respectively assigned to the above-mentioned words, for each of the same words in the weighted word groups associated with the received words and sentence units that are sorted in advance. A step of calculating a difference between assigned reference probabilities, a step of assigning priorities to sentence units that have been sorted in advance, in order of increasing difference of the reference probabilities, and a priority order to which the sentence units are assigned. And a step of outputting based on.

The sentence unit search device according to the seventeenth invention comprises means for acquiring document data from a set of documents in which a plurality of document data consisting of natural language is stored, and means for receiving words. In the sentence unit search device for searching the document set based on the accepted word, a means for separating the acquired document data into sentence units consisting of one or more sentences, and a sentence unit connected in the acquired document data A means for associating and storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence, a means for storing the words in the order received when words are received, and a new Each time a word is received, a means for associating a weighted word group composed of the plurality of words assigned a weight value in the word, and a weighted word group associated with the received word from a pre-sorted sentence unit, It comprises means for extracting a sentence unit in which similar weighted word groups are recorded in association with each other, and means for outputting the extracted sentence unit.

[0037] A computer program according to an eighteenth aspect of the present invention has received a computer capable of acquiring document data from a document set in which a plurality of document data composed of natural language is stored, and means for receiving words. In a computer program that can function as a means for searching the document set based on words, a means for separating the acquired document data into one or more sentence units, which are connected to the acquired document data Means for storing a weighted word group composed of the plurality of words assigned with a weight value for each sentence unit in association with each sentence unit, means for storing in the order received when words are received, new Each time a word is received, a means for associating a weighted word group consisting of a plurality of words to which a weight value for the word is assigned, It is characterized by functioning as means for extracting sentence units in which weighted word groups similar to weighted word groups associated with received words are recorded in association with the received words.

[0038] The computer-readable recording medium according to the nineteenth aspect of the invention is characterized in that the computer program of the eighteenth aspect of the invention is recorded.

[0039] The document storage device according to the twentieth invention is a means for storing a plurality of document data composed of a natural language, and the stored document data is divided into sentence units composed of one or a plurality of sentences in order from the top of the document data. For each sentence unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each sentence unit. Each sentence unit in the document data in the document storage device And means for storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence in association with each other.

[0040] In the document storage device according to the twenty-first invention, for one word among the extracted words, the one word is included from the weighted word group associated with each sentence unit. An extraction means for extracting a word group that is a word group and the weight value of the one word is equal to or greater than a predetermined value, and a value obtained by integrating the weight value of each word of the word group extracted by the extraction means for each word. A means for creating a related word group given as a degree of relevance to each word of the one word, and a storage means for storing the created related word group in association with the one word, the extracted The processing of the extraction means, the creation means, and the storage means is executed for each word, and each related word group is stored in association with each word. .

[0041] In the first invention, the seventeenth invention, the eighteenth invention, and the nineteenth invention, document data is acquired from a document set in which document data composed of natural language is recorded, and the acquired document data is further one or more. The sentence is divided into sentence units. For each sentence unit, each word that appears in the document set is given a weight value in that sentence unit, and a weighted word group of words assigned the weight value is stored in association with each sentence unit. The When a word is accepted, the weighted word group of words to which the weight value for the word is assigned is also associated with the accepted word. A sentence unit that is associated with a weighted word group similar to the weighted word group associated with the accepted word is extracted from the sentence units that have been sorted in advance and output.

[0042] In the second invention, the weighted word group stored in advance in association with the sentence unit when extracting the sentence unit associated with the similar weighted word group in the first invention. The distribution of the weight values of multiple words in is similar to the distribution of the weight values of multiple words in the weighted word group associated with the received word by determining whether or not a predetermined condition is satisfied. The sentence unit associated with the weighted word group determined to be similar is extracted.

[0043] In the third invention, when extracting a sentence unit associated with a similar weighted word group in the first invention or the second invention, a sentence unit in which the same word is included in the weighted word group Are extracted and assigned to the same word, and the difference between the weight values is small, and the priority order is assigned in order.

[0044] In the fourth invention, the weighted word group in the first invention is a multidimensional having each word as one dimension and having a weight value given to each word as a dimension element corresponding to each word. Obtained as a vector. Whether or not the weighted word groups are similar is determined based on whether or not the distance between the weighted word groups, that is, the distance between the multidimensional envelopes is short. The extracted sentence units are output in the order in which the distance between the multidimensional outer regions is short, that is, the weighted word groups are similar.

[0045] In the fifth invention, as the weight value assigned to each word in the first invention to the fourth invention, a reference probability that each word appears or is referred to in the subsequent sentence unit or word is calculated. Is granted.

[0046] In the sixth invention, the reference probability calculated in the fifth invention is the preceding sentence unit power specified for each word, the pattern of appearance up to each sentence unit, or from the preceding sentence unit Calculated as the rate at which words with the same feature pattern as the feature pattern including the reference pattern appear or are referenced in subsequent sentence units in the document set

[0047] In the seventh invention, the feature pattern specified for each word from which document collection power is also extracted, and the word for which the feature pattern is specified has appeared in subsequent sentence units in the document in the document set, or A regression analysis is performed on the determination result of whether the word is referred to, and a regression coefficient of the feature pattern with respect to the reference probability that the word appears or is referenced in the subsequent sentence unit is calculated. The reference probabilities calculated in the fifth invention are calculated from the feature patterns and regression coefficients of each word specified for each word.

In the eighth and ninth inventions, the document set is divided into a first document set made up of written words and a second document set made up of spoken word power. The reference probability given to each word in the weighted word group associated with the sentence unit is calculated based on the first document set, and the reference given to each word in the weighted word group associated with the accepted word The probability is calculated based on the second document set.

[0049] In the tenth invention, in calculating the reference probability in the sixth invention to the ninth invention, each word Dependent information on the number of words up to and including the current sentence unit or word when appearing or referenced in the preceding sentence unit or word Information such as the number of occurrences or references, word noun distinction, whether the word is the subject, whether the word is the subject, word personality, word part-of-speech information, etc. is quantified.

[0050] In the eleventh invention, when the reference probability is calculated in the sixth invention to the tenth invention, it appears or refers to the preceding sentence unit or word as a feature for specifying the feature pattern of each word. If it is, the time from the preceding sentence unit or word, the speech rate corresponding to the word when it appears or referenced, and the information of the high and low frequency of the voice are handled quantitatively.

[0051] In the twelfth invention, in the first invention to the eleventh invention, the word from which the document collecting power is also extracted.

Then, a weighted word group whose weight value is not less than a predetermined value is extracted. One weighted word group is created as a related word group by integrating the weight value of each word of a plurality of weighted word groups extracted from the one word. The degree of relevance of each word in the created related word group represents the depth of relation to the weight value of each word when a weight value greater than a predetermined value is given to one word. A group of related words is generated and stored for each word extracted from the document set. The weight value of each word of the weighted word group associated with each sentence unit or word is reassigned using the relevance level of each word of the related word group associated with each word.

[0052] In the thirteenth invention, when the related word group for one word is created in the twelfth invention, the word group extracted as a weighted word group whose weight value is greater than or equal to a predetermined value is The sum total weighted by the weight value for the one word in the weighted word group is calculated. The sum is averaged, and the sum of the weight values averaged for each word is given as the relevance of each word in the related word group.

[0053] In the fourteenth invention, each word of the weighted word group in which the relevance level of each word of the related word group stored in the twelfth or thirteenth invention is associated with each sentence unit or each accepted word. And the multiplication result is reassigned as the weight value of each word in the weighted word group. If attention is paid to one word in the weighted word group, it corresponds to one word. The relevance level of each word in the related word group is used. Higher relevance is obtained by multiplying the weight value of each word other than one word in the weighted word group by the relevance level of each word of the related word group associated with the one word. The influence of the weight value of the one word from the weight value of another word is taken into account.

[0054] In the fifteenth invention, in the related word group in the twelfth invention to the fourteenth invention, each word is one-dimensional, and the degree of relevance given to each word is a dimensional element corresponding to each word. Obtained as a multidimensional relevance vector. The multidimensional vector associated with each sentence unit or word is converted by a matrix of column power of related word vectors for each word. In other words, the multidimensional vector is represented by a multidimensional vector in an oblique coordinate system in which the distance between each one-dimensional word is high in the degree of relevance and the distance between the words is short. Therefore, a multidimensional vector representing a weighted word group has a high degree of association with a word included therein and is rotated in the direction of the word axis, and the distance between the multidimensional vectors including a word with a high degree of association is shorter.

[0055] In the sixteenth aspect, for each sentence unit obtained by further sorting the document data acquired from the document set, a word to be referred to from the sentence unit or the preceding sentence unit is extracted. A sentence pattern is identified, and a feature pattern including a pattern of combination of features leading to each sentence unit or a reference pattern from a preceding sentence unit of each word is identified. Based on regression learning of reference probabilities using the identified feature patterns, the reference probabilities for each extracted word are calculated and stored in advance as sentence-weighted word groups for each sentence. A feature pattern based on the preceding word is also specified for the accepted word, the reference probability of each word is calculated, and a weighted word group is associated. Pre-stored sentence units are output with priorities assigned in ascending order of difference in reference probabilities for the same word as the weighted word group of accepted words.

[0056] In the twentieth invention, for each sentence unit obtained by further sorting the document data acquired from the document set, a weighted word group to which a word weight is assigned in that sentence unit is stored in association with each other.

In the twenty-first invention, related word groups created for each word extracted from the document in the twelfth invention are stored.

The invention's effect [0058] According to the present invention, a weighted word group in which a weight value for each sentence unit of a plurality of words is assigned to each sentence unit having one or more sentence powers in the acquired document data. Correspondingly stored. The word group with weight values is a set of weight values of each word in each sentence unit, and can be estimated as information indicating a group of meanings in each sentence unit. By assigning each weight value a value that reflects the preceding sentence unit power and subsequent context, the weighted word group in each sentence unit in the separated sentence unit is a group of meanings in the whole document. In contrast, it can be understood as a group of meanings that dynamically change in a time series in the context flow that follows the previous sentence in the document. By extracting sentence units to which weighted word groups similar to weighted word groups given weight values in terms of words input for search are extracted, the word manifestation of the whole document, that is, Sentence units with similar meanings can be directly searched.

[0059] In addition, whether or not the weighted word group is similar is determined based on the distribution of the weight values of a plurality of words in the weighted word group of the accepted words and the weighted word group stored in advance. When comparing the distribution of weight values of multiple words and satisfying a predetermined condition that the distributions can be judged to be similar to each other, the weighted word group of words received by the stored weighted word group and It can be said that they are similar. For example, if the predetermined condition that can determine that the weighted word groups are similar is a condition that the distribution of the weight value of each word is similar! It can be said. That is, in one weighted word group, the ratio of the weight value of one word to the weight value of another word, the weight value of one word in the other weighted word group to the weight value of another word When the ratio is also stored, it can be determined that the weighted word groups are similar to each other. In addition, for example, when one or more words are focused on, the predetermined condition can be determined by setting whether or not the weight value of each word is equal to or greater than a predetermined value. In addition, when compared with the weighted word group associated with the received word and the weighted word group associated with the sentence unit that has been sorted in advance, the difference between the weight values of the same word is obtained. It is also possible to determine whether or not it is similar depending on whether it is small or not.

[0060] Further, by expressing the weighted word group as a multi-dimensional vector having each word as one dimension and having the sentence unit of each word or the weight value in the word as an element for each dimension component. A group of meanings for each sentence or word can be treated as a quantitative vector. In addition, by treating a sentence unit or a group of meanings for each word as a quantitative multidimensional vector, using a computer capable of vector calculation, a sentence unit stored as a vector associated with the accepted word Similar sentence units can be directly extracted by calculating the distance to the vector associated with each. Furthermore, by expressing it as a multi-dimensional vector, the conditions that the accepted words or the multi-dimensional vector of sentence units sorted in advance are satisfied are set according to which space in the multi-dimensional space corresponds to power or not. And similar sentence units can be extracted directly.

[0061] Here, the document set is not limited to V, a set of document data having a so-called written language ability. Therefore, it is not necessarily a sentence unit in which they are separated and a sentence unit that has written language ability. Document data means data that has already been stored and is distinguished from words that are received in real time, and may be document data in which spoken dialogues are written in order.

[0062] In addition, the accepted words are not limited to words, sentences, and the like that are input for the purpose of search, but may be, for example, each utterance including a voice during a dialogue between users. Sentence units are extracted based on weighted word groups to which weight values for each utterance are assigned, so that the meaning is considered considering that the meaning changes dynamically and chronologically for each utterance during the conversation. A cluster can be estimated for each utterance. Therefore, it is possible to extract and present sentence units similar to the presumed meaning group for each utterance.

[0063] Further, according to the present invention, the weight value of each word of the weighted word group is given as a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed. Thus, it can be expressed by a quantitative value indicating the degree, that is, the manifestation. Words that are important in the contextual sentence unit are considered likely to continue to appear or be referenced. Therefore, the reference probability can be expressed as the degree to which each word in the sentence unit is noticed, that is, the manifestation.

[0064] In addition, a word that is expressed by a demonstrative pronoun or zero pronoun, or a demonstrative pronoun or zero pronoun without actually appearing in each sentence unit! A word that appears or is referenced in a subsequent sentence unit or word, even if it does not actually appear in the unit or word, is considered to be highly apparent in that sentence unit or word. Since the reference probability is calculated based on the feature pattern of the word in the preceding multiple sentence units based on each sentence unit, even if the word does not actually appear, the level of visibility is more accurately It can be expressed quantitatively.

[0065] Furthermore, when a word is received by voice, whether or not the word included in the word has weight in the word from the characteristics of the voice when the word is uttered, that is, speaking speed and tone. Can be quantitatively characterized to express the high manifestation of each word.

[0066] Further, according to the present invention, when a sentence unit to be output as a search result is a written word, a reference probability is calculated based on a document set having written language ability, and when a received word is a spoken word, The reference probability is learned and calculated based on a document set that also has spoken language skills. As a result, sentence units with more similar meanings can be output based on the characteristics that differ between written and spoken language.

[0067] Also, according to the present invention, the degree of association from each word is quantitatively calculated and stored for each word. The weight value of each word in the weighted word group is recalculated based on the weight value of the other word and the relevance of the word to each of the word forces. As a result, the weight value of one word can reflect the influence of the weight value of a word having a high degree of association with one word among other words. That is, when the weight value of a word having a high degree of association with one word is high, it can be reproduced that the weight value of one word is high.

[0068] When a related word group for one word is expressed as a relevance degree vector and a weighted word group is expressed as a multidimensional vector, the multidimensional vector is converted with a matrix composed of a sequence of relevance vectors for each word. This shortens the distance between the multidimensional vectors representing the weighted word group including the words having a high degree of association.

[0069] As a result, among words other than one word in the weighted word group, the degree of relevance to the one word is high, and the influence of the word weight value is used as the weight value of the one word. It can be reflected. Reflecting the degree of relevance in the manifestation of each word in each sentence unit or word, the sentence unit that appears in the accepted word, even if it is recognized by the user, is effective It has an excellent effect such as being able to search automatically.

Brief Description of Drawings

FIG. 1 is an explanatory diagram showing an outline of a sentence unit search method according to the present invention. [2] FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search device according to the first embodiment.

圆 3] A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the first embodiment performs tagging and word extraction on the acquired morphological analysis and syntactic analysis processing on the acquired document data and stores them. is there.

FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means in the first embodiment.

[5] FIG. 5 is an explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of morphological analysis and syntactic analysis and stores in the document storage means.

[6] FIG. 6 is an explanatory diagram showing an example of a list of extracted words for all document data acquired by the CPU of the sentence unit search device according to the first embodiment.

[FIG. 7] The CPU of the sentence unit search apparatus according to Embodiment 1 extracts a sample from the tagged document data stored in the document storage means and performs a regression analysis to calculate the reference probability. It is a flowchart which shows the process sequence which estimates a regression type.

FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in document data stored in the document storage unit in the first embodiment.

FIG. 9 is a processing procedure for calculating and storing a word reference probability for each sentence of tagged document data stored in the document storage means by the CPU of the sentence unit search apparatus according to the first embodiment. It is a flowchart which shows order.

FIG. 10 is a flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 1 calculates and stores a word reference probability for each sentence of tagged document data stored in the document storage means. It is.

[11] FIG. 11 is an explanatory diagram showing an example in which the CPU of the sentence unit search device in Embodiment 1 sorts the document shown in the document data for each sentence.

12] An explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of calculating the reference probability and stores it in the document storage means.圆 13] Weight calculated for each sentence unit by the CPU of the sentence unit retrieval apparatus in the first embodiment It is explanatory drawing which shows the example of the content of the database at the time of indexing and memorizing an attached word group.

[14] FIG. 14 is an explanatory diagram showing how a set of words stored for each sentence by the CPU of the sentence unit search apparatus and a reference probability calculated for the word changes as the sentence continues.

FIG. 15 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.

FIG. 16 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.

FIG. 17 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.

[18] FIG. 18 is an explanatory diagram showing an example of a feature pattern specified for text data in which the CPU of the sentence unit search device according to the first embodiment also receives the receiving device power.

FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the second embodiment.

[20] FIG. 20 is an explanatory diagram showing an outline of the influence of the manifestation of a word closely related to one word, related to the search method of the present invention in Embodiment 3.

21] A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the third embodiment creates a related word group.

22] A flowchart showing a processing procedure in which the CPU of the sentence unit search apparatus according to the third embodiment creates a related word group.

23] An explanatory diagram showing an example of a weighted word group in each process when a related word group is created by the CPU of the sentence unit search device in the third embodiment.

24] A flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit.

[25] Details of the processing procedure in which the CPU of the sentence unit search device in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit It is a flowchart which shows.

FIG. 26 is an explanatory diagram showing an example of the content of a weight value representing the manifestation of each word calculated by the CPU of the sentence unit search device in the third embodiment.

FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.

FIG. 28 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.

FIG. 29 is a block diagram showing a configuration when the sentence unit retrieval method of the present invention is implemented by a sentence unit retrieval apparatus.

Explanation of symbols

[0071] Single sentence unit search device

11 CPU

13 Memory means

15 Communication means

16 Document set connection method

17 Auxiliary storage means

18 Portable recording media

1P control program

2 Document storage means

4 Reception device

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, the present invention will be specifically described with reference to the drawings illustrating embodiments thereof.

FIG. 1 is an explanatory diagram showing an outline of the sentence unit search method according to the present invention. 100 in FIG. 1 represents a document set in which a plurality of document data is stored, and one document 101 obtained from the document set 100 is a sentence unit S 1,. ..., S, S, ...

1 i i + 1. Sentence units S 1,..., S 1, S 2,.

1 i i + 1

Along the flow, it has a meaning that changes in time series. 200 in FIG. 1 represents a conversation between user A and user B. The conversation 200 between user A and user B is A set of utterances U, ..., U from time-series users A and B

The Conversations are made in the order of utterances U, U, U, U. User A and User B 3 r2 Γΐ j

May be regarded as a set of continuous utterances without distinction.

[0074] The sentence unit search method according to the present invention provides a degree of attention to each word at the time when the user writes or utters the sentence unit or the word as a quantitative weight value, and assigns it to each word, By using weighted word groups that reflect the degree of attention to each successive word unit in time series or each word that changes from word to word as an index representing contextual meaning in each sentence unit, The purpose is to directly search and output sentence units having the meaning of.

[0075] Conversation 200 in the example shown in the explanatory diagram of FIG. 1 is a conversation about travel to Kyoto between user A and user B. Utterance U 200 in conversation 200 is “Kyoto” and “Travel”

The contextual flow that appears is “Travel in Kyoto”. In utterance u, “Kyoto” and “Travel” do not appear, but it is an utterance about “Time of travel to Kyoto”, “Kyoto”, “Travel” and “Time”! There is a lot of attention! In U, "Hot! /," Appears! / In U, “Kyoto” and “Travel”

H H

Although not appearing, "" Kyoto is "hot" "and still" Kyoto "has a weight on contextual meaning. Furthermore, between user Α and user Β, at the time of U's utterance,

H

“Kyoto” and “time” are attracting more attention than “line”, and user A and user B should be able to recognize in common that the contextual implications are changing. Furthermore, “Famous” and “Festival” appear in Utterance U. Considering only the time of U's utterance, the words “Kyoto”, “Travel”, “Time” and “Hot” do not appear. However, at least for user A, utterance U has the meaning of “festival” in “Kyoto” in the “summer” context! Therefore, even at the time of utterance U, “Kyoto” still has weight on contextual implications. It should be noted that user A who utters utterance U should at least recall “Gion Festival” as a word corresponding to the festival.

In contrast, document 101 in document set 100 contains a travel note of Kyoto. Sentence unit S in this context has the meaning of “Gion Festival” when it comes to “Kyoto” in July. In other words, the sentence unit S has the meaning that it is “Gion Festival” or “Gion Festival” in “Summer”, “July”, “Kyoto”. In other words, utterance U and sentence unit S are common to “summer”, “Kyoto” and “festival”. Has weights and similar contextual implications. As described above, in the sentence unit search method according to the present invention, a sentence having a similar contextual meaning is estimated by estimating a group of contextual meanings from the preceding utterance that the user is aware of during the utterance U. Unit s directly

The purpose is to search and output k-wise.

[0077] When a computer system that implements the sentence unit search method according to the present invention is realized, it is only necessary to accept continuous utterances and extract document unity of sentence units similar to the contextual meaning of those words. First, during a conversation between user A and user B, the computer system can present a relevant information for each utterance and enter into the conversation. In addition, the computer system can support the conversation between user A and user B. In the example shown in Fig. 1, if the computer system outputs an audio message such as “Gion Festival in Kyoto in July” after utterance Uj by user A in conversation 100, user A And talk between User B and the computer system. In addition, when the conversation between user A and user B does not continue, information such as “Gion Festival for Kyoto in July” is presented by the computer system, so that the conversation between user A and user B Support is also realized.

[0078] Therefore, in order to realize a document collective search for such sentence units having similar contextual meanings, the computer system is made to execute the sentence unit search method according to the present invention. In this case, the computer device stores the document data of the document set in advance in units of sentences, and stores quantitative information representing the contextual meaning of each sentence unit in the divided sentence units. Pre-processing including processing to be prepared is required. In addition, when the computer device accepts an utterance, processing for obtaining quantitative information representing the meaning of the utterance in the conversation flow, and sentence units having similar meanings based on the information obtained for the utterance are extracted. A search process including a process of outputting and outputting as a search result is required.

Therefore, in Embodiments 1 to 3 described below, a hardware configuration necessary for causing a computer device to execute the sentence unit search method according to the present invention will be described first. Furthermore, the processing by the computer apparatus will be explained step by step by distinguishing the preprocessing and the search processing. Specifically, in each embodiment,

"1. Hardware configuration and system overview", As pre-processing

“2. Acquisition of document data and natural language analysis”, and

“3. Quantification of the meaning of each sentence of document data”,

Will be described in the order.

In Embodiments 1 to 3 described below, as an example of executing the sentence unit search method according to the present invention, hardware that stores a document set of document data and an utterance are accepted. A description will be given of a search system that includes a computer device and a computer device that executes a search process by connecting to a computer device that accepts utterances and nodeware that stores a document set.

[0081] Further, in the example shown below, each process and specific example are mainly shown in the case where the document set also has Japanese natural sentence power. However, it goes without saying that the sentence unit search method of the present invention can be applied not only to Japanese but also to other languages. In this case, the grammatical handling specific to each language, such as language analysis (morphological analysis, syntactic analysis), etc., uses the most appropriate method for each language.

[0082] (Embodiment 1)

1. Hardware configuration and system overview

FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search apparatus 1 according to the first embodiment. The retrieval system includes a sentence unit retrieval device 1 that executes retrieval processing from document data, a document storage unit 2 that stores document data in natural language, a packet switching network 3 such as the Internet, and a user input. Consists of accepting devices 4, 4,... That accept keywords or words such as speech. The sentence unit search device 1 is PC (Personal

Computer) and is connected to document storage means 2 for storing document data composed of natural language. Also, the accepting devices 4, 4,... Are also PCs, and the sentence unit retrieval device 1 is connected to the accepting devices 4, 4,.

In the search system of the first embodiment, the sentence unit search apparatus 1 stores document data including a sentence unit to be searched in the document storage unit 2 in advance. The sentence unit search device 1 The document data stored in the document storage means 2 is classified in advance into sentence units, and quantitative information representing contextual meaning is stored in each sentence unit so that it can be searched later. Further, the receiving devices 4, 4,... Convert the received words into text data or voice data that can be processed by a computer, and transmit the data to the sentence unit searching device 1 via the packet switching network 3. The sentence unit retrieval device 1 extracts one or more sentence units having sentence power from the document data stored in the document storage means 2 based on the received word data, and the extracted sentence units are transmitted via the packet switching network 3. A sentence-by-sentence search is realized by outputting to the receiving devices 4, 4,.

[0084] The sentence unit search device 1 includes at least a CPU 11 for controlling various kinds of hardware, an internal bus 12 for connecting various kinds of hardware, a storage means 13 including a nonvolatile memory, and a volatile type. Temporary storage area 14 consisting of memory, communication means 15 for connection to the packet switching network 3, document set connection means 16 for connection to the document storage means 2, and portable types such as DVDs and CD-ROMs And auxiliary storage means 17 using the recording medium 18.

[0085] The storage means 13 stores a control program IP acquired from a portable recording medium 18 such as a DVD or CD-ROM for the PC to operate as the sentence unit search device 1 according to the present invention. . The CPU 11 reads out and executes the control program 1P from the storage means 13, and controls various kinds of nodeware via the internal bus 12. The temporary storage area 14 stores information temporarily generated by the arithmetic processing of the CPU 11.

[0086] The CPU 11 detects that the word data transmitted from the accepting devices 4, 4,... Is received via the communication means 15, executes processing based on the received word data, and performs search processing. Do. Further, the CPU 11 acquires the document data stored in the document storage unit 2 through the document set connection unit 16 and stores the document data in the document storage unit 2 through the document set connection unit 16. It is possible.

[0087] The control program 1P stored in the storage means 13 obtained from the portable recording medium 18 such as a DVD or CD-ROM via the auxiliary storage means 17 is further stored in the storage means 13! Based on the dictionary information, it is possible to execute natural language analysis such as morphological analysis and syntactic analysis on document data expressed in character strings.

[0088] The accepting devices 4, 4,... Include at least a CPU 41 for controlling various types of software, various types of software, Internal bus 42 for connecting the software, storage means 43 composed of nonvolatile memory, temporary storage area 44 composed of volatile memory, operation means 45 such as a mouse or keyboard, and display means 46 such as a motor 46 Voice input / output means 47 such as a microphone and a speaker, and communication means 48 for connection to the packet switching network 3.

[0089] The storage means 43 stores a processing program for the PC to operate as the accepting devices 4, 4,. When the CPU 41 reads the processing program from the storage means 43 and executes it, the CPU 41 controls various nodewares via the internal bus 42. In the temporary storage area 44, information temporarily generated by the arithmetic processing of the CPU 41 is stored.

The CPU 41 can detect a character string input operation from the user via the operation means 45 and store the input character string in the temporary storage area 44. The CPU 41 detects the voice input from the user via the voice input / output means 47, reads the voice recognition program stored in the storage means 43, and executes it as text data. Can be converted. Further, the CPU 41 can input the voice inputted by the user as voice data that can be processed by a computer through the voice input / output means 47.

In addition, the CPU 41 transmits text or voice word data obtained by detecting a character string input operation or voice input from the user to the sentence unit search device 1 via the communication means 48.

[0092] Note that in the case where the CPU 41 may convert voice data into text data and transmit it, the CPU 41 utters features of voice data obtained by voice recognition, for example, phonemes corresponding to each word. You may also send data such as the speed at the time of being sent and the frequency of the phoneme corresponding to the word. The CPU 41 also stores the time difference between the speech data corresponding to each word, and sends the time difference from the point in time when the word was included in the previously accepted word to the sentence unit search device 1. May be.

[0093] 2. Document data acquisition and natural language analysis

In the search system configured as described above, the sentence unit search apparatus 1 first prepares a document set as pre-processing, and later represents a group of meanings for each sentence unit included in each document data. Process to make it possible. "2. Document data acquisition and In `` Language analysis '', the sentence unit search device 1 stores the document data in the document storage means 2, parses each document data into sentence units that have one or more sentences, The process of analyzing grammatical characteristics for each sentence and storing them in the document storage means 2 for each sentence unit will be described. In the first embodiment, a description will be given of a case where the sentence unit search device 1 uses one sentence as one sentence.

The CPU 11 of the sentence unit search device 1 stores document data including the sentence unit to be searched in the document storage unit 2 in advance. The CPU 11 of the sentence unit search device 1 acquires the document data that can be acquired via the communication unit 15 and the packet switching network 3 by Web crawling, and stores it in the document storage unit 2 via the document set connection unit 16. The CPU 11 of the sentence unit search device 1 classifies the document data acquired and stored in the document storage means 2 via the document set connection means 16 into sentence units, and performs language analysis (morphological analysis and syntactic analysis). ) And store the result in association with each sentence unit.

[0095] Hereinafter, the CPU 11 of the sentence unit search apparatus 1 acquires document data, performs natural language analysis of morphological analysis and syntactic analysis on the acquired document data, and stores it for each sentence unit. Will be described. FIG. 3 is a flowchart showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment performs tagging and word extraction from the analysis results of the morphological analysis and syntactic analysis processing for the acquired document data and stores them. It is. The processing shown in the flowchart of FIG. 3 is performed by extracting a word that appears in each sentence unit or a word that is referred to from the preceding sentence unit and a feature of each word in each sentence unit. This corresponds to the processing to be stored.

[0096] When starting Web crawling, CPU 11 determines whether or not it has acquired document data (step SI 1). If the CPU 11 obtains the document data and determines that! /, N! / (SI 1: N 0), the CPU 11 returns the process to step S11 and waits until the document data is obtained. When CPU 11 determines that the document data has been acquired (S11: YES), CPU 11 attempts to read each sentence from the acquired document data and determines whether the reading has succeeded (step S12). .

[0097] If the CPU 11 determines that the reading has not reached the end of the document data and the reading of the sentence has succeeded (S12: YES), the morphological analysis and syntactic analysis of the read sentence (Step SI 3).

The CPU 11 extracts words that appear in the analyzed sentence and words that are referred to from the preceding sentence in the sentence from the results of the morphological analysis and syntactic analysis, and stores them in the list (step S14). Further, as will be described later, the CPU 11 also generates a tag for the analysis result power (step S15), adds the tag to the read sentence, and stores it in the document storage means 2 via the document set connection means 16. (Step S16).

On the other hand, when the CPU 11 determines that the reading portion has reached the end of the document data and the reading of the sentence has failed (S12: NO), the processing for the acquired document data is terminated.

The above processing is performed every time document data is acquired !, and the tagged document data is stored in the document storage means 2.

Next, details of the above-described processing by the CPU 11 of the sentence unit search device 1 will be described with a specific example.

FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means 2 in the first embodiment. The document data stored in the document storage means 2 is stored in the HTML (HyperText Markup Language) obtained from a publicly accessible Web server connected to the packet switching network 3 via the communication means 15 by the CPU 11 of the sentence unit search apparatus 1. ) And other text data. The example shown in Fig. 4 is also a document of HTML data that can be obtained from a web page published on the Internet (http: 〃ja.wikipedia.org / wiki / excerpt). In the following, this document example will be used to explain document analysis and retrieval.

[0103] The CPU 11 of the sentence unit search device 1 converts the character string in the acquired document data into the sentence unit language unit (sentence unit) in the sentence reading process in step S12 shown in the flowchart of FIG. Sort. For example, when the document data is composed of Japanese, the CPU 11 uses a character string representing a punctuation mark “.” Or a character string representing a period “.” If the document data is composed of English. You may sort by.

Next, details of the morphological analysis and syntactic analysis processing in step S13 performed by the CPU 11 of the sentence unit searching apparatus 1 shown in the flowchart of FIG. 3 will be described. [0105] CPU 11 of sentence unit search device 1 performs morphological analysis based on dictionary information for the linguistic unit of "sentence", identifies the morpheme that is the minimum constituent unit of the sentence, and determines the morpheme structure. To analyze. For example, in the document data shown in FIG. 4, based on the dictionary information stored in the storage means 13, the CPU 11 uses a noun such as “Festival” and “God Spirit”, a proper noun such as “Kyushu”, a verb such as “speak”, “ A morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”. Various techniques for morphological analysis have been proposed today, and the present invention does not limit the morphological analysis techniques.

[0106] Furthermore, the CPU 11 of the sentence unit search device 1 uses the part of speech information (nouns, particles, adjectives, verbs, adverbs, etc.) for each identified morpheme, and Japanese grammar and English if it is a Japanese sentence. When syntactic analysis is performed, syntactic analysis is performed to extract grammatical relationships between morphemes based on grammatical information that statistically obtains cohesiveness between parts of speech based on English grammar. For example, by applying a grammar to a tree structure, it is possible to extract the relationship between morphemes according to the tree structure. When the analysis target is (adjective + noun + particle + noun), it is first determined whether or not the analysis target is a noun. If it is determined that it is not a noun, it is next determined whether or not the subject of analysis applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective phrase. When it is determined that the first morpheme is an adjective, it is determined that the adjective is the largest modifier in the analysis target that modifies the noun that follows. In other words, the relationship (adjective + (noun)) is extracted.

Next, it is determined whether or not the remaining analysis target is (noun). If it is determined that it consists of multiple morphemes and is not a noun, it is determined whether or not the remaining analysis target applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective. The first morpheme to be analyzed is an adjective! , The adjective part of (adjective + noun) is expanded to (noun + particle), and it is determined whether or not the remaining analysis target applies to ((noun + particle) + noun). If it is determined that the remaining analysis target is ((noun + particle) + noun), the grammatical relationship between the morphemes of the analysis target (adjective + noun + particle + noun) is [adjective + {(noun + particle ) + Noun}]. The method of syntactic analysis is not limited to the method based on such a method, but various methods are proposed today as well as the method of morphological analysis. Does not limit the method of syntactic analysis.

[0108] In the first embodiment, for example, morphological analysis and syntactic analysis are performed by ch _aS en (http: 〃 chasen.org) and CaboCha (Taku Eto, Yuji Matsumoto "Japanese dependency analysis by applying chunking stage". IPSJ Journal Vol. 6, No. 43, pp. 1834— 1842 (2 002), http: 〃 chasen.org/~taku/software/cabocha)). KNP (Kurohashi— Nagao Parser) (Kurohashi Ikuo, Nagao Iwao “Structural analysis of long Japanese sentences based on parallel structure detection” Natural language processing Vol. L, No. l, p p. 35-57 (1994 ) It is possible to analyze based on the technology disclosed in)! /.

The CPU 11 of the sentence unit search device 1 generates document data in which the analyzed morphemes and the grammatical relationships between the morphemes are represented by tags based on XML (extensible Markup Language), and stores them in the document storage means 2. In the natural language analysis method (chaser CaboCha) of morphological analysis and syntactic analysis used by the present invention, the input character string is morphologically analyzed and further syntactically analyzed to indicate the part-of-speech information of each morpheme and the morpheme information And so on, for each morpheme that is classified. The control program 1P stored in the storage means 13 of the sentence unit retrieval apparatus 1 is configured to allow the CPU 11 of the sentence unit retrieval apparatus 1 to execute the natural language analysis method.

[0110] In the morphological analysis and syntactic analysis used by the present invention, for example, the sentence "In the northern part of Kyushu is sometimes referred to as (o) kunchi" in the northern part of Kyushu is shown. First, a phrase number is assigned to the character string. (0: In the northern part of the Kyushu region, Z1: There are also Z2: sometimes called (O) kunchi in the fall.) Furthermore, each phrase is divided into morphemes, part-of-speech information for each morpheme, basic form of the morpheme Information, pronunciation information, etc. are added. The phrase number 0 is (0: Kyushu (noun + proper noun + region + —general, Kyushu, Yuyu) Z region (noun + —general, region, chihou) Z north (noun + —general, region, northern) In Z, (particle + case particle + —general, de) Z is (particle + subject particle, is c) z, (symbol + punctuation)), and morphemes are identified and information is added. The morpheme “Kyushu” is a noun, proper noun, a noun indicating the region, and is sometimes used as a general noun. Moreover, the basic form is “Kyushu”, and it can be determined that the pronunciation is “Kyushuyu”. The same applies to the other clauses. The dependency information is, for example, (0 2, 1 2, 2 —1) and the dependency relationship between phrases Can be obtained so that can be discriminated. In this example, it can be determined that the clause number 0 is the clause number 2 clause, and the clause number 1 clause is the clause number 2 clause. In addition, the phrase number 2 can be identified by having a relationship destination of -1 because there is no dependency destination.

FIG. 5 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search apparatus 1 according to Embodiment 1 gives and stores in the document storage unit 2 the results of morphological analysis and syntactic analysis. is there. This corresponds to an example of the document data stored in the document storage means 2 by executing the processing procedure shown in the flowchart of FIG. 3 on the document data having the contents shown in FIG.

[0112] As shown in Fig. 5, the CPU 11 of the sentence unit search apparatus 1 sorts a part of the document shown in Fig. 4 into morphemes such as proper nouns, nouns, particles, verbs, etc. Relevance is expressed by nesting tags. The example shown in Fig. 5 is based on the tagging method according to the rules proposed by GDA (Global Document Annotation; see http://i_content.org/gda). The present invention is not limited to complying with the rules. If the computer can identify morpheme information and dependency information between morphemes by information processing, the method is not limited to XML tagging.

[0113] Tagging based on GDA is basically represented by tag name attribute name = "attribute value">. In the example shown in FIG. 5, the tag indicated by <511> is a tag representing a sentence (Sentential unit). In the example shown in Fig. 5, the sentence “In the northern part of Kyushu is sometimes referred to as (O) kun in the autumn” is the sentence “in the northern part of Kyushu”, “ It can be identified by the tag that it has a unit of three clauses of “There is” and “. The tag indicated by & (1> is a tag that indicates a particle other than the final particle (part icle), adverb, adjunct, etc. The tag indicated by <n> indicates a noun, the tag indicated by <v> indicates a verb, and the tag shown in FIG. In addition to this, there are aj> tags etc. that indicate an adjective.

[0114] The attribute represented by the attribute name syn indicates a dependency relationship between language units such as clauses or words sandwiched between tags to which the attribute is assigned. Attribute value f (forward) is assigned This means that the linguistic unit that constitutes the sentence is closest to the subsequent linguistic unit. Therefore, in principle, phrase 0 “in the northern part of the Kyushu region” relates to phrase 1 “when it is called (O) kunchi for what happens in the fall”, The term “te (kun)” refers to “Yes” in clause 2.

[0115] By force and syntactic analysis, “in the northern part of Kyushu” in clause 0 is related to “there” in clause 2; The above principle does not apply, because it can be determined that “” is related to “Yes” in clause 2. Therefore, by adding “p” to each tag indicating “phrase” that is not the side of receiving the dependency, the relationship of the dependency can be shown. For example, the tag indicated by <adp> is a combination of the tag <ad> and "P" indicating a phrase. Indicates that the clause between the <adp> tags is an adverbial phrase and not the clause on which the dependency is received. Therefore, in the example shown in FIG. 5, the phrase 1 “even if it is called (O) kunchi for what happens in the fall” is an adverbial phrase and is not the receiving phrase, so the phrase “ “In the northern part of the Kyushu region” indicates that it is related to “Yes”, regardless of the phrase 1 “In the case of (O) kunchi for what happens in the fall”. In addition, "p" is added to indicate that it is a "phrase".

[0116] Also, the tag indicated by <n> can be shown as not being a word on the side where the dependency is received by setting np>. “North Kyushu” can be classified into “Kyushu”, “Region”, and “North”, respectively, with morphemes sandwiched between n>, because “Kyushu” is related to “Region” and “Region” is related to “North”. "Is unnecessary. On the other hand, in the case of “events (events, events), festivals”, “events (events, events)” are related to “no” regardless of “festival”. With>, the dependency relationship can be shown.

[0117] A proper noun representing a place such as "Kyushu" or a proper noun representing a person's name such as "Taro" can be indicated by a tag of placename> <pername>, respectively.

[0118] A morpheme referenced from a preceding word or sentence such as a demonstrative pronoun or a zero pronoun can be expressed using an attribute indicating an anaphoric relationship. In GDA, the attribute name id can be used to indicate whether the pronoun or zero pronoun indicates the preceding word or sentence. For example, for a sentence “There is a button on the right side, please press it”, if a human reads this, it can be naturally supplemented that “it” refers to a “button”. Shi However, when it is processed by a computer, it is not possible to determine what it is to show that “it” can be identified as a directive pronoun by checking against dictionary information. Therefore, in GDA, the id attribute is added to the “button” indicated by “it”, and “it” = “button” is set by the attribute name eq indicating the equality relationship with the morpheme indicated by the id attribute. Can show. Specifically, “There is a button on the right side, so press it.” In contrast, “There is a button np id =“ Btn ”> button / np> on the right side, so np eq =“ Btn ”> it Press </ np> ”(other tags are omitted) to indicate the relationship of“ it ”=“ button ”.

[0119] For zero pronouns, there is no pronoun itself to which the eq attribute can be added. Therefore, it is possible to indicate the object represented by the zero pronoun by adding information that clearly indicates the object to the verb “push” whose action is “it” = “button”. Therefore, the attribute name obj indicating the object of the action of the morpheme that is sandwiched by the tag can indicate that the object of the push action is “button”. Specifically, for the sentence “There is a button on the right side, please press it”, there is “np id =“ Btn ”> button / 1 ^> on the right side, so <v obj =“ Btn You can specify the relationship with the omitted target by saying "> Press 7> Please".

[0120] Further, even when the word to be referred to is different from the word to be referred to, the corresponding relationship can be indicated by the id attribute, the eq attribute, and the obj attribute described above. For example, there is “np id =“ Btn ”> button“ Znp> ”on the right side.” “<Np 6 = '1 1,.

> Is marked with an X. "When you stop v obj =" Btn "> Press / v>", the second sentence "it" indicates "button" and the third sentence " It can indicate that the object of “push” is a “button”.

[0121] In addition, information indicating the result of morpheme analysis is added to the attribute information of a tag such as n><ad><v> that sandwiches each morpheme with the attribute name mph. The attribute value indicates part-of-speech information, basic form information, pronunciation information, etc. of the morpheme obtained by morpheme analysis. Specifically, for the attribute name mph, additional information, part-of-speech information, inflected form information, basic form information, and pronunciation information are attribute values, and mph = “additional information; part-of-speech information; inflected form information; basic form information; Information ". In the example shown in Figure 5, “Kyushu” uses part of speech information as a noun + proper noun + It can be classified by region + —general, the basic form is Kyushu, and it can be pronounced “Kyuushiyu” and is clearly indicated by the mph> tag. In the present invention, since morphological analysis and syntactic analysis are performed based on the method presented in chasen, identification information such as chasen is added as additional information of the morpheme.

[0122] As described above, the CPU 11 of the sentence unit search apparatus 1 tags the document data obtained by Web crawling with the results of tagging the results of morphological analysis and syntactic analysis according to GDA rules. Certain XML data is stored in the document storage means 2 via the document set connection means 16. By storing the document data as XML data, the CPU 11 of the sentence unit search apparatus 1 identifies the tag of the document data by character string analysis, and identifies the attribute information attached to the tag to identify each attribute data. Can identify morpheme information and grammatical relationships.

[0123] Further, the CPU 11 of the sentence unit search device 1 extracts a word appearing in all the acquired document data and stores an identification number in a list when the morphological analysis is performed on the document data acquired by Web crawling. Memorize in 13. FIG. 6 is an explanatory diagram illustrating an example of a list of extracted words for all document data acquired by the CPU 11 of the sentence unit search device 1 according to the first embodiment. In the example shown in the explanatory diagram of FIG. 6, 31245 words are listed. It should be noted that common words such as “thing” and “thing” are excluded from the stored words. This is because the word is too general like a conjunction or article, and although it appears frequently, the word itself does not make sense, so the search processing is burdensome and inappropriate as a search target.

[0124] 3. Quantification of the meaning of each sentence in document data

3- 1. Definition of the meaning of each sentence

Next, the CPU 11 of the sentence unit search device 1 specifies information that quantitatively represents a group of meanings of the sentence for each sentence in the document data stored in the document storage unit 2. Information that quantitatively expresses the meaning of a sentence means a group of words that the user is paying attention to when the user uses the sentence (speaking, writing, listening, or reading), and the user pays attention to each word. This is expressed by a value (word weight value) that quantitatively indicates the degree of salience.

[0125] The manifestation of each word in the sentence depends on the frequency of appearance that has been achieved by conventional search services. Therefore, it can also be quantified. However, the appearance frequency is obtained based on the document or the entire document set. Therefore, by calculating the appearance frequency of each word for each document, it is possible to quantitatively represent the meaning of the whole document, but the context changes dynamically according to the flow in the document. It cannot represent a set of meanings that reflect

[0126] In addition, the manifestation of a word in a sentence is grammatically defined by the degree of attention of the word in the preceding sentence and the transition of the degree of attention of the word in the current sentence depending on how the word is used. It can be expressed separately. In other words, if the word that was the subject (subject) in the preceding sentence is also the subject (subject) in the current sentence, the word is the most noticeable in the current sentence, and it is highly obvious. Yes, there is. On the other hand, words that appear in the preceding sentence! /, Na, are the subject (subject) in the current sentence, although they are attracting attention in the current sentence, but continue to be used as the subject mentioned above. It can be said that the obviousness is low. This manifestation formula 匕 has been studied as a centralized theory (Grosz e t al., 1995, Nariyama, 2002, Poesio et al., 2004).

[0127] In the formula 匕 based on the centralization theory, the manifestation of each word is not represented as a feature value for quantitative calculation by a computer or the like. It is only possible to determine whether the transition of each word belongs to one of the transitions defined by the centralization theory. Therefore, the present invention quantitatively calculates the manifestation of each word in each sentence.

In Embodiment 1, the reference probability for each sentence is calculated for each word, and the calculated reference probability is assigned as a weight value representing the manifestation of each word for each sentence.

[0129] Because, as the word is attracted attention in the sentence, the probability that it will continue to appear or be referenced in the subsequent sentence is high! This is because the probability of being played is used as the reference probability, and can be regarded as the manifestation of the word. In addition, the reference probability that a word appears or is referenced in a subsequent sentence is a word that can be analyzed by information processing by the sentence unit search device 1 that does not feature the meaning of a word that is difficult to handle quantitatively. A feature pattern that includes a pattern that appears or includes a reference pattern is identified, and the percentage of words that appear or referred to in the same feature pattern as the specified feature pattern actually appear or referred to in subsequent sentences is used as the reference probability. Calculated.

[0130] Hereinafter, the reference probability for each word is defined as a weight value for each word, and each weight value is assigned. A set of words in the given sentence is called a weighted word group. A group of meanings for each sentence unit can be expressed by a weighted word group to which a quantitative weight value called a reference probability is given.

[0131] 3-2. Regression model learning

In calculating the reference probability, the number of occurrences of the same feature pattern as the specified feature pattern is calculated as the reference probability of the ratio of the same feature pattern in which the word actually appears or referenced in the subsequent sentence. . At this time, if the same feature pattern as the identified feature pattern appears in a large amount and almost the same number for each feature pattern, the reference probability can be calculated statistically without any problem. However, the actual number of identical feature patterns is limited, and enormous amounts of document data are required to calculate reliable reference probabilities. Therefore, a regression equation for predicting whether or not a subsequent sentence appears or is referenced from the feature pattern of a word that is a factor of the occurrence of the event is used as a feature pattern and actually appears or referenced in the subsequent sentence. This is obtained by learning a regression model with the events.

[0132] The following are the steps for “3-2-1. Identifying feature patterns” and “3-2-2. Learning regression equations” for feature patterns, which are samples for regression model learning. Are described separately.

[0133] 3— 2— 1. Feature pattern identification

Sentences in the document data stored in the document storage means 2 are sandwiched between tags indicated by <su>, and words that appear in the sentence, or words that have an anaphoric relationship with a pronoun or zero pronoun in the sentence, It can be specified by tag attribute information. Therefore, in the sentence unit search device 1 of the present invention, the feature pattern is specified as follows for the document data stored in the document storage means 2.

[0134] A sample (s, w) is a pair of one sentence s in the document data and a word w included in a sentence preceding the one sentence in the document data. The feature pattern f (s, w) for the sample is specified by the following feature amount. The feature (dist) of the distance (number of sentences) between the sentence s and the word w force among the sentences preceding the sentence s, and the sentence s that recently appeared or referenced, and the word w in the sentence preceding the sentence s If it appears or is referenced recently, the particle characteristic associated with the word w For example, the feature amount (gram) and the feature amount of the number (chain) in which the word w appears or is referenced in the sentence preceding the sentence s can be given as examples. Note that the feature amount is not limited to this, and may be whether or not the word w is a word indicating a recent topic, or whether or not the word w is a personality.

[0135] In the document data stored in the document storage means 2, the results of morphological analysis and syntactic analysis are described by tags conforming to the GDA, so they are delimited by the tag <su> by character string analysis of the document data. Sentence classification and counting, identification of particles based on part-of-speech information indicated by tags within each sentence, and counting of the number of occurrences of words including those referred to by demonstrative pronouns or zero pronouns are possible. Therefore, the CPU 11 of the sentence unit search device 1 can specify the feature quantities dist, gram, and chain for each sample by analyzing the tag and its attribute value according to GDA.

[0136] The CPU 11 of the sentence unit search device 1 extracts a sample from the tagged document data stored in the document storage means 2, and obtains a feature amount from the extracted sample to identify a feature pattern. The processing procedure for estimating the regression equation by regression analysis is also described to calculate the reference probability of the feature pattern force of the extracted sample. FIG. 7 shows a case in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 extracts a tagged document data force sample stored in the document storage means 2 and performs a regression analysis to calculate a reference probability. It is a flowchart which shows the process sequence which estimates these regression equations. The process shown in the flowchart of FIG. 7 includes a process for identifying a feature pattern for each sentence unit, and a result of determining whether or not the feature pattern and the identified word appear or are referenced in subsequent sentence units. This corresponds to the process of performing regression learning to calculate the reference probability based on.

The CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S21). The CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis and sorts it into sentences (step S22). Next, the CPU 11 identifies each tag in su> indicating the sentence by character string analysis, and extracts a sample by associating the word appearing in the sentence or the word to be referred to with the sentence (step S23). A tag is identified by character string analysis for the extracted sample, and a feature pattern consisting of d 1st, gram, and chain is specified (step S 24). [0138] CPU 11 determines whether or not the separated sentence is the end of the acquired document data (step S25), and if CPU 11 determines that the separated sentence is not the end of the document data (S 25: NO), CPU 11 returns the process to step S22, and continues the process of sorting by identifying the su> tag in the subsequent sentence. Whether the sorted sentence is the end of the acquired document data is determined by, for example, whether or not it is a force that the tag is followed by SU></SU> that includes the currently sorted sentence. If it is determined that it does not follow, it can be determined that it is the end.

On the other hand, when the CPU 11 determines that it is the end of the document data (S25: YES), the CPU 11 determines whether or not extraction of a predetermined number of samples is completed (step S26). When CP Ul 1 determines that sample extraction is complete! /, N! / (S26: NO), CPU 11 returns the process to step S21 to obtain different tagged document data. Continue sample extraction.

[0140] If the CPU 11 determines that the sample extraction is completed (S26: YES), the CPU 11 performs a regression analysis on the extracted sample and obtains a regression equation for each feature quantity dist, gram, and chain. Estimate the regression coefficient (step S27) and end the process.

FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in the document data stored in the document storage unit 2 according to the first embodiment. The characteristic pattern f (s, Taro-kun) of the sample (s, Taro-kun) of the sentence s in the sentence s shown in Figure 8 and the word “Taro-kun” in the preceding sentence is as follows: Identified. The distance feature (dist) between the current sentence _Si and the sentence s where the word “Taro-kun” appeared or referred to recently in the preceding sentence is immediately after s.

i-1 i

Since the number of sentences up to the following sentence s is 2, dist = 2. Recently, “Taro-kun” appeared or i + 1

Because the particle with the word “Taro-kun” (referred to him) in the referenced S is “ha”, g

i-1

ram = C Furthermore, the word “Taro-kun” appears or is referenced in sentences s and s preceding sentence s.

i i-2 i-1

Therefore, chain = 2. Therefore, the feature pattern is specified as f (s, Taro-kun) = (dist = 2, g ram = c, chain = 2). In English, gram is specified by preposition. [0143] As described above, the sample (s, w) is also extracted from the sentence power in the document data, and the feature pattern f (s, w) is specified for all the extracted samples.

[0144] 3-2- 2. Learning regression equations

Next, detailed processing will be described for the regression analysis in step S27 shown in the flowchart of FIG.

[0145] In the first embodiment, regression analysis is performed based on a logistic regression model. The regression analysis is not limited to this, and other regression analysis methods such as kNN (k—Nearest Neighbors) smoothing + Support Vector Regression (SVR) model may be used.

[0146] When the kNN smoothness + SVR model is used, the regression model can be learned using the following 8 elements as the feature quantities of the feature pattern that can be handled. With the 8 elements, the following 5 elements can be handled as feature values in addition to the dist, gram, and chain described above. One may be the type of noun (exp, pronoun: 1Z non-pronoun: 0) when the word w is referenced within the previous sentence unit. Another one may be whether the word w is the subject when it appears or is referenced in the previous sentence unit (last-topic, yes: lZno: 0). The other may be whether the word w is the subject when it appears or is referenced in the preceding sentence unit (last—sbj, yes: l / no: 0). The other one may be whether the word w is a personal person (pi, yes: l / no: 0) in the sample, w). Another one may be the part of speech information (pos, noun: 1, verb: 2, etc.) of the word w in the immediately preceding sentence unit when the word w appears or is referenced. Another one may be whether the word w is referenced in the title or heading in the document (in_header, yes: lZno: 0). In addition, when performing regression analysis based on speech data, one of eight elements is the time-dist of the nearest reference location of the word (time—dist), the latest reference of the word. Speaking speed per syllable of the phrase containing the phrase (ratio to the average of the speakers) (syllable-speed), frequency ratio of the lowest utterance pitch and the highest utterance pitch of the phrase including the reference part closest to the word Any one or more of (pitch—fluct) can be used. Even if the feature amount of the voice data! Even if the regression analysis is performed, the CPU 11 of the sentence unit search device 1 receives the voice data as the word data as will be described later, the feature amount power also calculates the reference probability. can do. [0147] As described above, when the kNN smoothness + SVR model is used, the reference probability can be calculated based on a more detailed feature amount, and a more precise reference probability can be calculated.

[0148] In the first embodiment, the word w actually appears or is referenced in the sentence s following the sentence s.

+1

Whether the feature pattern is dist, gram, or chain for the sample variable (s, w) and whether it is a feature quantity or not, the logistic regression model is used for all samples (s, w). Regression analysis. As a result, when the feature quantities dist, gram, and chain are given, the regression equation for calculating the probability Pr (s, w) that the word w appears or is referenced by s i + 1 i + 1

Can be obtained.

[0149] The probability obtained by the Logistic Regression model is generally obtained by the following equation (1) with respect to the explanatory variables (features) xl, x2, ···, xn.

[0150] [Equation 1]

Pr =... (1)

1 + exp (b ₀ + bi j + b ₂ X2 + ■ + b _n x _n )

[0151] The parameters (regression coefficients) b, b,..., B in Eq.

0 1 n

Therefore, estimate. The regression analysis of the reference probability of the word W in the sentence s calculated by the present invention means that the explained variable is 0, the sample that does not appear or is referenced in the subsequent sentence s, is 0 or appears or is referenced

The sample is set to 1, and the explanatory variables are dist, gram, and chain, which are feature quantities, and the extracted samples are learned to estimate the parameters (regression coefficients) b, b, b, and b in the following equation (2) To do

0 1 2 3

And point to.

[0152] [Equation 2]

Pr =... (2)

1 + exp (bo + bydist + Ingram + b ^ chain)

[0153] The parameter (regression coefficient) that also learned the extracted sample force is, for example, b = — 1.425, b

0 1 0 · 564, b = 11. 036, b = 3. 115 estimated (10000 Sampu Noreka et al. 'J apportion Analysis). In this case, Equation (3) that applies these parameters is a recursive equation for obtaining the reference probability.

[0154] [Equation 3]

Pr =... (3)

1 + exp (-1.425-0.564 x dist + 1 1.036 x gram + 3.1 15 x chain)

[0155] Estimated parameters (regression coefficients) The values of b 1, b 2, b 3 and b are stored in the document storage means 2

0 1 2 3

It depends on the document data. For example, the estimated parameters differ depending on whether the document data stored in the document storage means 2 is useful only for newspaper articles that are written words or only if the utterances that are spoken words are converted to document data. In addition, even for document data that only has the same kind of newspaper articles as written words, the estimated parameter values b 1, b 2, b 3, and b differ depending on the amount of the document data and the content of the document data.

0 1 2 3 Therefore, in the present invention, for regression analysis in spoken language, document data is stored separately for written language and spoken language, and parameters are estimated by regression analysis even for document data with spoken language ability. Then, the regression equation for calculating the reference probability is stored. If the words accepted by the accepting devices 4, 4,... Are limited to texts that are written and that can be written by text input instead of speech, the document data is spoken and written. Alternatively, the document storage means 2 may store them without distinguishing them.

[0156] Through the above regression analysis, parameters for the characteristic quantities dist, gram, and chain of the regression equation of Equation (3) are obtained. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability of the word having the feature pattern by specifying the feature pattern having the feature quantity dist, gram, and chain power of each word in the sentence unit. .

[0157] 3- 3. Quantification of manifestation per sentence unit

Since the regression equation was obtained by regression analysis, the CPU 11 of the sentence unit search device 1 calculates the reference probability for each word by specifying the feature quantities dist, gram, and chain for each word extracted for each sentence unit. can do. Therefore, the CPU 11 of the sentence unit search device 1 acquires the tagged document data stored in the document storage means 2 and classifies the data for each sentence. A feature pattern is specified for the word that appears in or the word to be referenced, and the reference probability is calculated. As a result, it is possible to quantitatively represent a group of meanings for each sentence that reflects the contextual meaning of the preceding sentence.

[0158] Regarding the processing in which the CPU 11 of the sentence unit search device 1 calculates the word and the reference probability (weighted word group) for each word for each sentence of the document data stored in the document storage means 2 after regression analysis. This will be described below.

[0159] The CPU 11 of the sentence unit search device 1 acquires the document data stored in the document storage means 2, and for each sentence included in the document data, grammatical of each word in the sentence and the preceding sentence. Specific feature patterns are identified, and the reference probabilities for each word are calculated for each sentence based on the identified feature patterns and regression equations, and stored in advance.

[0160] The CPU 11 of the sentence unit search apparatus 1 stores a pair of each word and the reference probability of each word (weighted word group) in association with each sentence unit. That is, the CPU 11 performs processing for storing all the texts of all the documents acquired from the document set. On the other hand, the CPU 11 extracts a sentence whose contextual meaning is similar to the accepted word in all the sentences of all the documents in a later search process. Therefore, in this case, it takes a heavy processing load to read out all the sentences of all the documents one by one and read out the weighted word group representing the contextual meaning of each sentence associated with each.

[0161] Therefore, the CPU 11 of the sentence unit search apparatus 1 reads out the weighted word group representing the contextual meaning of the preceding sentence for each sentence one by one in the subsequent process. In order to make it possible to extract without any problem, the weighted word group calculated for each sentence is converted into a database and indexed.

FIG. 9 and FIG. 10 show that the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2. It is a flowchart which shows the process sequence to take out and memorize | store. The process shown in the flow charts of FIGS. 9 and 10 is a process for calculating a reference probability using a feature pattern identified for each word and a regression coefficient corresponding to the feature pattern for each sentence unit. This corresponds to the process of storing the calculated reference probabilities in pairs with words.

[0163] The CPU 11 of the sentence unit search device 1 sends the document storage means 2 to the document set connection means 16 via the document set connection means 16. The tagged document data is acquired (step S301). CPUll identifies the tag “ _SU >” added to the acquired document data by character string analysis and classifies it into a sentence (step S302). Next, CPUl l identifies each tag in su> indicating the sentence by character string analysis, extracts words that appear in the sentence or words that are referenced in the sentence (step S3 03), and extracts the document. While the reference probability is calculated for the data, the extracted word is stored in the temporary storage area 14 (step S304).

[0164] The CPU 11 identifies the tag added to the word by word analysis for the word of the document data including the sentence stored in the temporary storage area 14, and also has a dist, gram, and chain force. Is identified (step S305). Next, CPUll calculates the reference probability by substituting each feature quantity of the identified feature pattern into equation (3) (step S306).

[0165] CPUll determines whether or not the power of calculating the reference probability of each word for the sentence for all the words stored in temporary storage area 14 (step S307). If CPU11 determines that reference probabilities have not been calculated for all words (S307: NO), CPU11 returns processing to step S305 to identify feature patterns for other words and determine reference probabilities. Continue calculation. On the other hand, if the CPU 11 determines that the reference probabilities have been calculated for all the words (S307: YES), the CPU 11 sets the word stored in the temporary storage area 14 and the reference probabilities calculated for each word. The (weighted word group) is stored with the salience attribute added (step S308). At this time, the CPU 11 narrows down the reference probability by a predetermined value, and does not memorize words having a reference probability less than the predetermined value.

[0166] Next, the CPU 11 performs indexing and weighting so that a set of words and reference probabilities for each word (weighted word group) attached to the current sentence can be extracted later. Store in the word group database (step s309). The CPU 11 may store the database in the storage unit 13 or may store it in the document storage unit 2 via the document set connection unit 16. The CPU 11 executes the following process as one of the indexing processes.

[0167] For example, the CPU 11 pays attention to the reference probability of one word in the weighted word group obtained in step S308, and determines whether or not the reference probability of the one word is greater than or equal to a predetermined value. Next, the CPU 11 determines whether or not the reference probability of another word in the weighted word group is a predetermined value or more. CPU11 refers the calculated weighted word group to one word If a group has a probability greater than or equal to a predetermined value, a group with a reference probability of one word less than a predetermined value, and belongs to a group with a reference probability of one word greater than or equal to a predetermined value, then another word It is determined whether the group belongs to a group having a reference probability equal to or higher than a predetermined value or a group having a reference probability of another word lower than a predetermined value. The CPU 11 determines to which group the weighted word group calculated by repeating such processing belongs, and stores it in association with the identification information of the group to which it belongs. For example, a kd tree search algorithm can be applied to this indexing process.

CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to the entire sentence in the document data acquired in step S301 is completed (step S310). The CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to all sentences in the document data is as follows. For example, after su> <Zsu> that sandwiches the current sentence, it is determined whether or not it is followed by a su> tag. If it is determined that it does not follow, it can be determined to be the end. If CPU 11 determines that the process of associating the weighted word group for each sentence is not completed for all sentences in the document data acquired in step S301 (S310: NO), CPU 11 returns the process to step S302. Continue processing for the next sentence. On the other hand, if the CPU 11 determines that the processing for associating the weighted word group for each sentence is completed for the entire sentence in the document data acquired in step S301 (S31 0: YES), the CPU 11 extracts the document data. Then, the word stored in the temporary storage area 14 is deleted (step S311).

The CPU 11 determines whether or not the process of storing the word and the word reference probability with the salience attribute is completed for all document data (step S312). If CPU11 determines that the process of storing the word and the word reference probability with the salience attribute has not been completed for all the document data (S312: NO), CPU11 returns the process to step S301 and The document data is acquired and processing continues. If the CPU 11 determines that the processing of storing the word and the word reference probability by the salience attribute is completed for all document data (S312: YES), the CPU 11 calculates the word reference probability and stores it in advance. The memorizing process is terminated.

[0170] Next, the CPU 11 of the sentence unit search apparatus 1 performs the processing shown in the flowcharts of FIGS. A case will be specifically described in which the above is performed on the document data shown in FIG.

FIG. 11 is an explanatory diagram showing an example in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment classifies the document shown in the document data for each sentence.

[0172] The CPU 11 of the sentence unit search device 1 identifies <su> tags from the document data stored in the document storage means 2 and separates them for each sentence by the processing of step S301 and step S302. In the example shown in Figure 11, the sentence is s “Festival is a ritual that enshrines spirits, etc.”, s “Festival, ritual

1 2

Also called. ”, S“ In the northern part of the Kyushu region, it is called (O) kunchi for what happens in the fall.

Three

There is also a case. ”. The word from which the sentences s, s, and s force are also extracted by the processing of step S303 by the CPU 11 of the sentence unit retrieval apparatus 1 is the word stored in the word list.

one two Three

Matching “Festival”, “God Spirit”, “Ritual”, “Festival”, “Ritual”, “Kyushu”, “Kyushu Region”, “North Kyushu Region”, “Autumn”, “Kunchi”, “Case” (See Figure 6).

[0173] The CPU 11 of the sentence unit search apparatus 1 uses the sentence s of each word group by the process of step S305.

In order to quantitatively determine the manifestation (reference probability) of 3, the feature pattern consisting of the feature quantities dist, gram, and chain of each word group is specified. For example, “Kyushu” (identification number: 9714) in sentence s (

Three

The characteristic pattern (see Fig. 6) is specified as follows.

[0174] As shown in the explanatory diagram of Fig. 11, the dist of "Kyushu" in the sentence s is the latest appearing sentence s and later

3 3 dist = 1 due to the distance 1 from the next sentence s. Also, the gram of “Kyushu” in sentence s has recently been

4 3

In the sentence S in which “state” appears, “Kyushu” is not a particle but a noun

Three

It can be identified as a gram noun connection. The chain of “Kyushu” in the sentence s

The number of occurrences of “3 1 3 states” is one, so chain = l. Therefore, the feature pattern f (s, Kyushu) = (dist = l, gram = noun connection, chain = 1) is specified. Therefore, the statement

Three

The CPU 11 of the unit search device 1 calculates the reference probability by substituting the values of the feature quantities dist, gram, and chain into the equation (3) by the process of step S306 in the flowcharts of FIGS.

[0175] Here, the substitution value of the feature value represented by gram is obtained by extracting the sample (s, w) from the document data stored in the document storage means 2 and calculating the reference probability of the word w calculated for each. Calculate the average value for each gram and use it as the substitution value. For example, among the extracted samples (s, w), the average value of the reference probabilities calculated for the words having gram = c is a value to be substituted when the feature quantity gram is “c”. In Embodiment 1, as an example, when gram = c, gram = 0 0540, if gram = ga, gram = 0.0288, if gram = no, gram = 0.019 8, if gram = wo, gram = 0.0179, if gram =-, gram = 0 If gram = noun connection, gram = 0.00352 is calculated.

[0176] It should be noted that when the word relates to the particle "ha", to the particle "ga", to the particle "no", or to the particle "wo", the word appears in the following sentence The average value of the reference probabilities is the centralized theory that indicates whether or not it is the center of the sentence in the order of “no,” (subject), “ga” (subject), “no”, “wo” (object). Almost consistent with the order of subject, subject, object ...

[0177] Reference probability of “Kyushu” in sentence s (probability of “Kyushu” appearing or referenced in sentence s) is specified

3 4

Based on the obtained feature amount, the following equation (4) is calculated.

[0178] [Equation 4]

Pr =

1 + exp (-1.425― 0.564 x l + 11.036 x 0.00352 + 3.1 15 x 1)

= 0.238

[0179] As shown in Equation (4), the reference probability of “Kyushu” in the sentence s is calculated as 0.238. Calculated

Three

The reference probability is stored for the sentence s. CPU11 of sentence unit search device 1

3 3 On the other hand, the word is represented by an identification number stored in a list, and the reference probability is stored in association with it. In the present invention, the attribute name salience is defined for the su> tag that separates sentence units, and the attribute value is defined as a list of word identification numbers and reference probabilities. Stores the word and the reference probability (weighted word group) of the word.

[0180] <su salience = "Word identification number: Word reference probability Word identification number: Word

1 1 2 2 Reference Probability Word 3 Identification Number: Word 3 Reference Probability… "> ~ <Zsu>

FIG. 12 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search device 1 according to the first embodiment gives the result of calculating the reference probability and stores the result in the document storage unit 2. In sentence s, the reference probability of “Kyushu” (9714) (weight value in sentence s, and so on) is 0.238, The reference probability of “North Kyushu” (9716) is memorized as 0.1159,…

4 “Kyushu” (9714), reference probability SO.238, “Festival” (22953), reference probability 0.1836, and so on. Different sentences and sets of reference probabilities (weighted word groups) are stored for each sentence, and can be used for retrieval as information representing a group of meanings for each sentence. In sentence s and sentence s,

3 4 Kyushu (9716) has the same reference probability, but every time the sentence s, sentence s, ...

5 6

If the description of “Festival”, which is limited to the Kyushu region, continues, the reference probability for “Kyushu” will gradually decrease.

FIG. 13 is an explanatory diagram showing an example of the contents of a database when the CPU 11 of the sentence unit search device 1 according to Embodiment 1 indexes and stores weighted word groups calculated for each sentence unit. . The content example in FIG. 13 is associated with the sentence s shown in the content example in FIG.

Four

This corresponds to the data indexed by step S309 of the CPU 11 shown in the flowcharts of FIGS. 9 and 10.

As shown in FIG. 13, the CPU 11 stores the weighted word group in association with information (k-d tree node ID) indicating to which group it belongs. Further, at that time, the CPU 11 identifies the file name of the tagged document data and the position in the document data so that the weighted word group is associated with the sentence unit of the misaligned document data. Remember (tag information). This makes it easy to extract sentence units associated with weighted word groups similar to the weighted word groups obtained for words received in later processing.

FIG. 14 shows how the set of words stored for each sentence by the CPU 11 of the sentence unit search apparatus 1 and the reference probabilities calculated for the words change as the sentence continues. FIG. In Figure 14, context continues in time series as sentence s, sentence s, sentence s, sentence s continue.

1 2 3 4

It can be seen that the words with high obviousness differ in each sentence according to the dynamic change.

[0185] 4. Search processing

4- 1. User ability Accepting entered words

Next, search processing in the first embodiment will be described. The search process is based on the reception of keywords such as keywords or speech input by the receiving devices 4, 4,. Start as a point.

[0186] The CPU 41 of the accepting device 4 detects a character string input by the user via the operation means 45 and stores it in the temporary storage area 44, or a voice input by the user via the voice input / output means 47. Can be detected, converted into a character string, and stored in the temporary storage area 44. The CPU 41 of the accepting device 4 has a function of analyzing a character string input by the user and separating it into one sentence and one sentence. For example, a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified. In addition, each time the Enter key is pressed is detected via the operation means 45, the character string until the Enter key is input may be separated from one sentence. For voice input with user power, for example, the voice may be converted into a character string by the voice recognition function, and may be classified into sentences by the converted character string analysis. May be separated. The CPU 41 of the accepting device 4 transmits the sorted sentences and sentences as text data to the sentence unit retrieval device 1 via the communication means 48.

[0187] 4- 2. Quantification of a set of meanings for accepted words

Next, when the CPU 11 of the sentence unit search device 1 receives text data indicating the words accepted by the accepting devices 4, 4,..., It searches for sentences in the document stored in the document storage means 2. Processing to be performed will be described. For text data indicating accepted words, quantification of meaning groups is performed, that is, word extraction of the text data and calculation of word reference probabilities. As a result, information indicating a group of meanings reflecting the context corresponding to the flow from the preceding words in the user's latent consciousness when the user inputs words can be used as a search request in the search processing described later. Can be created automatically.

[0188] When the CPU 11 of the sentence unit search device 1 receives text data indicating a word received by the user from the reception devices 4, 4,... Via the packet switching network 3 and the communication means 15, the temporary storage area 14 stores the text data. Text data is stored in the order received, and morphological analysis and syntactic analysis are performed on the sentence indicated by the received text data. For the pair (s, w) of the sentence _s shown in the received text data and the word w that appeared in the sentence shown in the text data received before the sentence _s , the feature values dist, gram, Specify the feature pattern f (s, w) represented by chain. [0189] When the CPU 11 of the sentence unit retrieval apparatus 1 identifies the characteristic notation f (s, w) of the word w in the sentence s of the received text data, the identified characteristic pattern and the previously obtained regression equation Based on this, the reference probability is calculated. The CPU 11 of the sentence unit search device 1 calculates a reference probability for each word, and uses the word and the reference probability calculated for each word to store a weighted word group that is already stored in association with the sentence unit. In other words, a sentence-by-sentence search is performed by comparing each word with a set of reference probabilities for each word.

It should be noted that the CPU 11 of the sentence unit search device 1 can receive not only text data but also speech data of utterances input by the user from the reception devices 4, 4,. In this case, the same processing is performed by specifying the grammatical feature pattern of the words expressed in the voice data as in the text data. In the case of speech data, it is also possible to treat features obtained from speech data as features for determining whether or not the word is highly apparent. For example, when a word appears or is referenced, the CPU 11 can treat the time difference from the appearance or reference of the preceding word as one feature quantity. Further, the CPU 11 can treat the speech speed and the Z or speech frequency when the word is uttered as other feature quantities in the latest preceding words where the word appears or is referenced. These are information that quantitatively represents time information or emotion embedded in words that cannot be detected after being converted to text data.

[0191] The accepting device 4 accepts a word input from the user and sends it to the sentence unit retrieval device 1, and the document storage means 2 uses the CPU 11 of the sentence unit retrieval device 1 based on the text data received from the acceptance device 4. A processing procedure for storing and searching from document data will be described with reference to a flow chart. FIG. 15, FIG. 16, and FIG. 17 are flowcharts showing the processing procedure of the search processing of the sentence unit search device 1 and the reception device 4 in the first embodiment.

[0192] The CPU 41 of the accepting device 4 determines whether or not the user has detected a character string input operation via the operation means 45, or whether the user has detected a voice input via the voice input / output means 47. Judgment is made (step S401). If the CPU 41 determines that the character string input operation or voice input by the user has not been detected (S401: NO), the CPU 41 returns the process to step S401 and detects the character string input operation or voice input by the user. Wait until [0193] On the other hand, if the CPU 41 of the receiving apparatus 4 determines that the user has detected a character string input operation or a voice input (S401: YES), the CPU 41 of the receiving apparatus 4 receives the input character string or voice input. From the converted character string, the input words are separated into one sentence and stored in the temporary storage area 44 (step S402), and the input words are also transmitted to the sentence unit search device 1 via the packet switching network 3 (step S402). Step S403).

[0194] The CPU 11 of the sentence unit search device 1 receives the word input by the user from the reception device 4 (step 3404). 1; 11 stores the received words as text in the temporary storage area 14 as text data in the order of reception (step S405). At this time, a sentence identification number may be added to each text data and stored.

The CPU 11 performs morphological analysis and syntactic analysis on the stored text data (step S406), and stores the words extracted by the analysis in the temporary storage area 14 (step S407). At this time, the CPU 11 compares the word stored in the list with the identification number of the list and stores the word.

[0196] Note that, by the processing in step S407 of the sentence unit search device 1, the temporary storage area 14 stores words that have appeared or referred to once in a series of words (utterances) input. become. Note that word extraction in step S407 is not necessarily performed. In that case, a feature pattern specific process described later is performed on all words stored in the list.

[0197] For each word stored in temporary storage area 14, CPU 11 calculates a feature pattern based on the text data received and stored in the past and the results of morphological analysis and syntactic analysis in step S406. Identify (step S408). The CPU 11 substitutes the feature quantity of the identified feature pattern into a regression equation for calculating a reference probability obtained by performing regression analysis on the spoken language in advance, and calculates a reference probability for each word (step S409). The CPU 11 determines whether or not the reference probabilities have been calculated for all the words stored in the temporary storage area 14 (step S410). If the CPU 11 determines that the reference probabilities have not been calculated for all the words stored (S410: NO), the process returns to step S408 to specify the feature pattern and reference probabilities for other words. The calculation process is performed.

[0198] When it is determined that the CPU 11 has calculated the reference probabilities for all the words stored (S 410: YES), the reference probabilities are calculated and stored in the temporary storage area 14 respectively, and the words having the reference probabilities of a predetermined value or more are narrowed down (step S411). This is to reduce the load on the CPU 11 itself by the subsequent calculation by removing words with extremely low reference probabilities. The CPU 11 performs the following search processing based on the words narrowed down to the accepted words and the word reference probabilities.

[0199] With the processing so far, for the accepted words, search for pairs of words and word reference probabilities (weighted word groups) that quantitatively represent the group of semantic meanings that follow the previously accepted word power Could be generated as request. The following search processing (from step S412 to step S416 surrounded by a one-dot chain line) uses a weighted word group obtained for received words and a weight word group for each sentence stored in advance. Comparing and determining whether or not words and sentences have similar meanings based on whether the weight value distributions of multiple words in each weighted word group are similar, and extracting similar sentences It is an example of the process to perform.

[0200] The CPU 11 reads from the database of the storage means 13 or the document storage means 2 a pair of words and word reference probabilities stored in association with each sentence (hereinafter referred to as a weighted word group) ( Step S412).

[0201] At this time, the weighted word group associated with the accepted word obtained by the processing up to step S411 is stored in the database so that the CPU 11 can narrow down and read the weighted word group somewhat similar. Similar to the weighted word group stored in, it is determined which group it belongs to. The CPU 11 reads the database power of the weighted word group of the group to which the weighted word group associated with the received word belongs. As a result, it is possible to avoid comparison with weighted word groups that are not similar at all, and to narrow down and extract weighted word groups that are somewhat similar.

Next, the CPU 11 extracts a weighted word group including the same words as the weighted word group of the received word from the weighted word group read out in step S412 (step S413). The CPU 11 calculates a reference probability difference for each word that is the same as the extracted sentence (step S414). CPUl l assigns similarities to the extracted weighted word groups in descending order of the number of identical words and the difference in reference probability S of the same words (step S4 15), and the extracted weighted word groups Whether the sentence associated with is a document set document data (Step S416). At this time, the CPU 11 may read a sentence corresponding only to a weighted word group having a similarity equal to or greater than a predetermined value. The CPU 11 sorts the extracted sentences by similarity (step S417).

[0203] The weight value distribution of a plurality of words in the weighted word group obtained for the accepted words by the processing from step S412 to step S417 described above, and a weight having a distribution of similar weight values. Sentences with associated word groups can be extracted.

Next, the CPU 11 transmits text data representing each sentence as text data of the search result to the accepting device 4 via the communication means 15 (step S418).

[0205] The CPU 41 of the accepting device 4 receives the text data of the search result via the communication means 48.

(Step S419), the received text data is displayed on a monitor or the like via the display means 46 (Step S420), and the process is terminated.

[0206] The CPU 41 of the accepting device 4 transmits text data or speech data separated into one sentence to the sentence unit searching device 1 each time an input of a user power word is detected. The CPU 11 of the sentence unit search device 1 calculates a word and a reference probability for each word each time it receives text data or voice data, or information transmitted together with the voice data from the reception device 4, and converts it into a word received from the user. On the other hand, information representing a group of meanings reflecting the flow of preceding word power, that is, a weighted word group is created as a search request. The CPU 11 of the sentence unit search device 1 extracts sentence units from the stored document data based on the search request (weighted word group) created for the accepted words, and sends the text data as the search results. The

[0207] The CPU 41 of the accepting device 4 in the first embodiment displays the text data of the search result on the monitor or the like each time it is received. Therefore, every time a user-spoken word is input, the reception device 4 displays text data similar in meaning to that word as a search result.

[0208] Note that the receiving device 4 does not necessarily have to be configured to transmit text data each time a user spoken word is input and to receive and display a search result. For example, a configuration in which text data or voice data corresponding to a plurality of words input during a predetermined period is transmitted to the sentence unit search device 1, and search results corresponding to the plurality of words are received and displayed. Good.

Details of the processing by the CPU 11 of the sentence unit search apparatus 1 shown in the flowcharts of FIGS. 15, 16, and 17 will be described below with specific examples.

FIG. 18 is an explanatory diagram showing an example of a feature pattern identified for text data received from the receiving device 4 by the CPU 11 of the sentence unit searching device 1 according to the first embodiment. Sentence unit S, sentence unit S, and sentence unit S in Fig. 18 are indicated by the received text data.

i-2 i-1 i

Is a sentence.

[0211] The feature pattern of the sample (s, uchichi) with the word s and the word "Okuchi" included in the preceding sentence unit in sentence unit S in Figure 18 is specified as follows: Is done. Among the current sentence s and the preceding sentence, the dist = 3 is the distance feature (dist) between the sentence s and i i−2 where the word “Okunuchi” has recently appeared or was referenced. Also, since the case particle related to “Okunuchi” in s in which the word “Okunuchi” has recently appeared or was referenced is “tsute”, gram = tte

i-2

is there. Furthermore, the word “Okunchi” appeared or was referenced in the sentence s preceding the sentence s.

i i-2

= 1. Therefore, the feature pattern is specified as f (s, ouchi) = (dist = 3, gram = tte, c hain = l). In English, gram is specified by preposition.

[0212] In the sentence unit search device 1, for the spoken word, the regression analysis is performed on the document data stored in the document storage means 2, and when the feature pattern is specified, the reference probability is calculated by substituting the feature amount. A regression equation that can be used is derived in advance. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability for the “snoopy” of the sentence s based on the feature quantities dist, gram, and chain of the identified feature pattern. Further, the CPU 11 of the sentence unit search device 1 calculates a reference probability including a word that has appeared or referred to in the past for the sentence s, and obtains a word and a reference probability of the word. The CPU 11 of the sentence unit search device 1 determines the reference probability of the same word from the sentence unit in which the salienc attribute stored in the document storage unit 2 is stored in advance based on the obtained word and the reference probability. A sentence unit that is greater than or equal to is directly extracted. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15.

[0213] By processing of the CPU 11 of the sentence unit search apparatus 1 as described above, the meaning of words represented by the received text data is expressed by word and word reference probability (weight value) for each word. be able to. In addition, for each sentence of the document data stored in advance in the document storage means 2, a word representing a group of meanings and word reference probabilities (weighted word group) are stored. Sentences whose meanings are similar can be directly searched based on whether or not the extracted words have similar reference probabilities.

[0214] (Embodiment 2)

In the second embodiment, for each sentence of the document data stored in the document storage means 2 in the pre-processing stage, a pair (weighted word group) of the extracted word and the reference probability calculated for each word is used as the manifestation vector. deal with. Furthermore, a pair (weighted word group) of a word calculated for an accepted word and a reference probability calculated for each word is also treated as a manifestation vector. Then, at the stage of the search process, as shown in the first embodiment, the weight value distribution of the plurality of words in the weighted word group of the accepted words and the weighted word previously associated with each sentence. Whether or not the distribution of weight values of a plurality of words in the group is similar is determined based on whether or not the same word is stored and the difference between the same words is small. On the other hand, in the second embodiment, each weighted word group is represented by a manifestation vector, and whether or not the condition is a similar condition is determined by the shortness of the distance between the manifestation vectors.

[0215] Regarding the “1. Hardware configuration and overview” and “2. Document data acquisition and natural language analysis” of the search system using the sentence unit search device 1 according to the present invention in Embodiment 2 Since it is the same as that of the first embodiment, the description is omitted. “3. Quantification of meanings for each sentence of document data” and “4. Search process” will be described using the same reference numerals as those in the first embodiment described below. Note that “3. Quantification of meanings for each sentence of document data” and “4. Search processing” are the same as in Embodiment 1 and will not be described in detail.

[0216] 3. Quantification of the meaning of each sentence in document data

3-1. Definition of meaning for each sentence

In the second embodiment, as in the first embodiment, information that quantitatively represents a group of meanings for each sentence is used by the user when the user uses the sentence (speaking, writing, listening, or reading). It is expressed as a group of words that the user is interested in, and a value (word weight value) that quantitatively indicates the degree to which the user pays attention to each word, that is, the salience. In addition, as in Embodiment 1, the manifestation Use a reference probability that indicates the probability that it will appear or be referenced in subsequent sentences as a quantitative weighting value.

[0217] 3- 2. Regression model learning

Also in the second embodiment, the reference probability includes the regression coefficient obtained by the regression analysis on the sample of the document data stored in the document storage means 2, as in 3-1. Regression model learning of the first embodiment. Calculate using regression equation.

[0218] 3- 3. Quantification of manifestation per sentence unit

Also in the second embodiment, the CPU 11 of the sentence unit search apparatus 1 uses the regression formula including the regression coefficient obtained by the regression analysis to identify the feature quantities dist, gram, and chain for each extracted word. Thus, the reference probability for each word can be calculated. Here, a weighted word group is obtained by assigning the reference probability for each word as the weight value of the word. In the second embodiment, the weighted word group that represents a group of meanings for each sentence has a one-dimensional word, and has a reference probability calculated for each word as an element of a dimension component corresponding to each word. Treat as a tuttle. That is, the meaning of sentences in the document data stored in the document storage means 2 is extracted from the document data stored in the document storage means 2 and stored in the list shown in FIG. It can be represented by a vector in dimensional space.

[0219] Therefore, for the 31245-dimensional basis space that also has the word group power (Ai, Aida, Ambiguous, ..., Z, Z-kun), the manifestation vector v (s) of the sentence s shown in Fig. 11 9714 in the sentence s

3 3 The element corresponding to the 3rd “Kyushu” dimension is represented by the magnitude (weight value) of 0.238, and the element corresponding to the 9716th “North Kyushu” dimension is the magnitude of the reference probability. Since it is expressed as 0.111 59, it can be expressed and treated with (0, 0, ..., 0. 238, 0, 0. 1159, ... .

[0220] The document data to be stored in the document storage means 2 with the result of the CPU 11 of the sentence unit search apparatus 1 calculating the reference probability in the second embodiment stored in the document storage means 2 is shown in the explanatory diagram of FIG. 11 of the first embodiment. This is the same as the document data shown. That is, the document data stored in the document storage means 2 stores the dimension number and the value of the reference probability that is an element of the dimension component. The CPU 11 of the sentence unit search apparatus 1 according to the second embodiment calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2 and associates it with each sentence. Since the processing procedure stored in the database is the same as that in the first embodiment, the explanation is omitted.

<o

[0221] 4. Search processing

Next, the search process in the second embodiment will be described. Regarding “4-1. Receiving words input by the user”, the processing performed by the CPU 41 of the receiving device 4 is the same as that of the first embodiment.

[0222] 4- 2. Quantification of the set of meanings for accepted words

A process for searching for a sentence in a document stored in the document storage unit 2 when the CPU 11 of the sentence unit searching apparatus 1 receives text data indicating a word received by the receiving apparatus 4 will be described. The CPU 11 of the sentence unit search apparatus 1 also represents a set of contextual meanings of the accepted words as textual manifestation vectors indicating the directionality in the multidimensional space of the words for the text data indicating the accepted words. .

[0223] As in the processing in Embodiment 1, the CPU 11 of the sentence unit search device 1 uses the feature amounts dist, gram, and 31245-dimensional words stored in the list for the text data received from the reception device 4. Specifies the feature pattern represented by chain. It should be noted that if it appears in the text data received as a series in the past, the feature pattern specification is omitted by setting the corresponding dimension component element to 0 for the word.

[0224] From the feature quantities dist, gram, and chain representing the feature pattern, the reference probabilities as elements of the dimension component can be calculated based on the regression equation. Therefore, each time the text data is received, the CPU 11 of the sentence unit search device 1 can calculate a manifestation vector representing a group of meanings in the context of the word indicated by the received text data.

[0225] The CPU 11 of the sentence unit search device 1 compares the manifestation vector calculated for the received word and the manifestation vector of the sentence with the salience attribute added in advance stored in the document storage means 2. The distance is directly calculated by an outer calculation, and a sentence with a short distance is extracted. Sentences with similar directionality of meanings can be searched in a 31245-dimensional multidimensional space where each word in Fig. 6 is one-dimensional. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15. Vector operation If you use a computer that can handle, you can directly calculate the meaning of each sentence as a manifestation vector.

[0226] The CPU 11 of the sentence unit search device 1 receives the text data indicating the search request word in the receiving device 4, and stores the document data in the document storage means 2 based on the received text data. A processing procedure for performing a search using force manifestation vectors will be described. FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the second embodiment. In the processing procedure shown in the flowchart of FIG. 19, the same reference numerals are used for the same steps as the processing procedures of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. Detailed description is omitted.

[0227] Of the processing procedure shown in the flowchart of FIG. 19, the processing power from step S501 to step S506 surrounded by the alternate long and short dash line The processing shown in the flowchart of FIGS. 15, 16, and 17 in the first embodiment Different from the procedure. Instead of the processing from step S412 to step S416 in the first embodiment, the processing from step S501 to step S506 executed by the CPU 11 of the sentence unit search apparatus 1 in the second embodiment will be described below.

[0228] The CPU 11 of the sentence unit search device 1 narrows down to words for which a reference probability equal to or greater than a predetermined value is calculated for all words stored in the temporary storage area 14 by calculating the reference probabilities (steps). S411), the manifestation vector of the accepted word is calculated based on each narrowed word and the calculated reference probability of each word (step S501).

[0229] Through the processing up to step S501, a manifestation vector that quantitatively represents a group of meanings in the flow following the previously accepted word power can be generated as a search request for the accepted word. The following processing compares the manifestation vector obtained for the accepted word and the manifestation vector of each sentence stored in advance, and calculates the weight value of each word represented by each manifestation vector. It is an example of the process which determines whether it is a force with similar distribution.

[0230] CPU 11 reads the weighted word group stored in the database, that is, the manifestation vector (step S502). At this time, the obviousness vector force stored in the manifestation vector force database associated with the accepted words obtained in the processes up to step S411 is used. In the same manner as in the above, it is determined to which group it belongs. The CPU 11 reads the manifestation vector of the group to which the manifestation vector associated with the accepted word belongs from the database. As a result, it is possible to narrow down and extract a manifestation vector having a similar distribution of weight values for each word.

[0231] CPU 11 calculates the distance between the saliency vector associated with the accepted word and the read saliency vector (step S503). The CPU 11 narrows the read manifest vector to the manifest vector whose calculated distance is less than the predetermined value (step S504), and reads the sentence that is stored in association with the narrowed manifest vector. (Step S505) _{o The} CPU 11 gives similarities to the read sentences in order of increasing calculated distance (step S506).

[0232] By the processing from step S501 to step S506 by the CPU 11 of the sentence unit searching apparatus 1 in the second embodiment, sentences having similar contextual meaning to the accepted words are extracted.

[0233] The processing after step S417 for the extracted sentence is the same as in the first embodiment.

[0234] It should be noted that the processing in step S503 for calculating the distance between the manifestation vector associated with the words received by the CPU 11 and the read manifestation vector in the above-described processing procedure is concretely. Calculate as follows. When the manifestation vector associated with the accepted word U is represented as v (u) and the retrieved manifestation vector force (s), the CPU 11 calculates the cosine distance as shown in equation (5) below. Is calculated.

[0235] [Equation 5]

I) II ') I

[0236] However, when the distance is calculated as shown in Equation (5), the calculated cosine distance value becomes larger as the word manifestation vector) and the read manifestation vector v (s) are closer. Become. Accordingly, in step S506, the CPU 11 assigns similarities in descending order of the calculated cosine distance.

[0237] The processing of the CPU 11 of the sentence unit search device 1 and the CPU 41 of the reception device 4 as described above makes the meaning of the received words a manifestation with the reference probability of each word as an element for each word. It can be expressed as a vector. In addition, since each sentence of the document data stored in advance in the document storage means 2 stores a manifestation vector whose elements are the reference probabilities of each word representing a group of meanings. Sentences whose meanings are similar can be directly searched for by the distance between the manifestation vectors representing the directionality in the multidimensional space.

[0238] (Embodiment 3)

In the first or second embodiment, in the process of performing “3. Quantification of meaning group for each sentence unit of document data” in the pre-processing stage, the word and word reference probability as a weighted word group. Or a manifestation vector is stored in association with each sentence unit. Also, in the subsequent “4. Processing of search”, in the processing of “4 2. Quantification of meaning group for accepted words”, a set of words and word reference probabilities as a weighted word group, or manifestation. The actuality vector was matched with the accepted words. In contrast, in Embodiment 3,

For weighted word groups (a pair of words and word reference probabilities, or a manifestation vector) associated with sentence units or words, weight values representing the manifestation of each word are deeply related to the word. A recalculation process is performed taking into account associations from other words.

[0239] Specifically, an association is a case where a word in a group of weighted words associated with each sentence unit does not appear in that sentence unit or the preceding sentence unit. If the word is deeply related to the word and the word is highly apparent, it means that the word is also attracting attention in units of sentences. Therefore, when a single word is attracting attention, a word that is easily noticed at the same time is taken as a related word. Then, the influence of the visibility of closely related words is reflected in the weight value representing the manifestation of each word.

[0240] FIG. 20 is an explanatory diagram showing an overview of the influence of the manifestation of one word and a word closely related to the search method of the present invention in the third embodiment. The explanatory diagram of FIG. Represents an example of conversation between users. A conversation is a set of utterances u, U, U, U

1 2 3 4

, U, U, U, U j has been made.

1 2 3 4

Here, “Osaka” does not appear in any of the utterances U 1, U 2, U 3, and U. U

"Osaka" appeared in the utterance preceding 1 2 3 4 1, and "Osaka" in each utterance U, U, U, U

1 2 3 4

Even if it has a certain level of visibility that is not zero, “Osaka” has not appeared since then, so the reference probability that reveals the manifestation of “Osaka” at the time of utterance u is calculated quantitatively.

Four

If issued, the value may have dropped.

[0242] However, even if the word "Osaka" does not appear in the previous sentence unit or word, the words "America Village" and "Minami" appear in Utterance U and U. . did

13

Therefore, “American Village” and “Minami” are calculated when the reference probability is calculated at the time of utterance u.

Four

The value should be high. Both “America Village” and “Minami” are representative business districts of “Osaka”, so even if the word “Osaka” does not appear or is referenced in the utterance u, “America” or “Minami”

Four

The appearance of “Nami” should naturally increase the manifestation of “Osaka”, which is closely related. Therefore, in the example of Fig. 20, the reference confirmation showing the manifestation of "Osaka" in Utterance U.

Four

The rate should have a high value.

[0243] Therefore, in Embodiment 3, the weight value representing the manifestation of each word associated with each sentence or word is recalculated in consideration of the manifestation of the related word (related word).

[0244] In order to recalculate the reference probability to a weight value that takes into account the manifestation of related words, the sentence unit search device 1 first obtains information representing the power of which the words are closely related to each other. It is necessary to keep it. Then, the influence of the relevance representing the depth of the relation is reflected in the reference probability of each word calculated for each sentence unit. Specifically, for example, when the above example is used, the degree of association of “America Village” with “Osaka” is quantitatively calculated. Next, it is calculated as a weight value that represents the manifestation of “Osaka” on a sentence-by-sentence basis by reflecting the effect of the relevance to “Osaka” on the reference probability of “America Village”. Recalculate and store.

[0245] Therefore, in the third embodiment, first, the sentence unit search device 1 creates a weighted related word group for one word to which the relevance of each word to one word is given as a weight value. . Specifically, in the first or second embodiment, a weighted word group that is stored in association with each sentence unit by the processing of “3-3. The sentence-by-sentence search apparatus 1 creates a weighted related word group for each word by using the combination of the word and the reference probability of the word or the manifestation rule. The sentence unit search device 1 creates and stores a weighted related word group for each word extracted from the entire document set.

[0246] Next, the sentence unit search device 1 stores the weighted word group that is stored in association with each sentence unit, that is, the combination of the word and the word reference probability or each word of the manifestation vector. The influence of the reference probability of words that are closely related to each word is reflected in the reference probability using the degree of association, and the weight value of each word is recalculated and stored.

[0247] Furthermore, the sentence unit search apparatus 1 similarly uses the degree of relevance for the weighted word group associated with each word, that is, the combination of the word and the word reference probability or the manifestation vector in the search process. Then recalculate the weight value of each word. The sentence unit search device 1 performs a search process based on the word corresponding to the accepted word and the weight value recalculated for each word.

[0248] The process of creating a weighted related word group for each word by the CPU 11 of the sentence unit search device 1 will be described below with the addition of the section “3-4. In addition, with regard to the processing to recalculate the reference probabilities calculated in “3-3. Explained by adding the section “3-5. Quantification of a group of meanings with association”. For the processing to re-calculate the reference probability calculated in “4-2. Quantification of the meaning group for accepted words” into a weight value that takes into account the relationship, execute “4 2 '. This section will be explained with the section “Quantification of a group of meanings taking into account the association”.

[0249] In the third embodiment, "1. Hardware configuration and overview" and "2. Document data acquisition and natural language analysis" of the search system using the sentence unit search apparatus 1 according to the present invention. “Is the same as that of the first embodiment, and the description thereof is omitted. “3. Quantification of meaning group for each sentence of document data” and “4. Search process” will be described below, but will be described using the same reference numerals as in the first embodiment. Note that “3. Quantification of meaning set for each sentence of document data” and “4. Search processing” are also omitted from the detailed description of the points in common with the first embodiment. [0250] 3-4. Creation of related terms

The related word group is created by performing the following processing by the sentence unit search device 1 for every word extracted in the explanatory diagram shown in FIG.

[0251] First, the sentence unit retrieval apparatus 1 uses a weighted word group stored in association with every sentence unit in "3-3. Quantification of manifestation per sentence unit". A word group with a weight that has a reference probability of the word or more is extracted. This is because, as described above, the related word is a word that is likely to be noticed at the same time when one single word is noticed, so that the sentence unit is removed when one word is noticed. It is to do.

[0252] Next, the sentence unit search device 1 integrates the weighted word groups having the reference probability of one word that is extracted by the above-described processing and having a predetermined value or more. Specifically, the reference probability of each word in each weighted word group is weighted by the reference probability of one word included in the weighted word group, and the reference probability of each word is averaged. The reason why the reference probability of one word is weighted is that the reference probability for each word in the weighted word group having a higher reference probability of one word is used.

[0253] Then, in order to treat all weighted related word groups in the same manner, the weight value of each word in the weighted related word group is normalized.

[0254] Hereinafter, a process in which the CPU 11 of the sentence unit retrieval apparatus 1 that performs the sentence unit retrieval method according to the present invention creates a related word group will be described. FIG. 21 and FIG. 22 are flowcharts showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment creates a related word group. The processing shown in the flowcharts of FIG. 21 and FIG. 22 includes processing for extracting a word group having a weight value equal to or greater than a predetermined value for one word, and integrating the weight value of each word of the extracted word group as a degree of association. The process of creating a group of related words assigned to each word, the process of storing it in association with a single word, and for each word! Corresponds to the process that executes each process.

[0255] The CPU 11 of the sentence unit search device 1 selects one word from the list stored in the storage means 13 (step S601). The CPU 11 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S602). CPU11 identifies the tag “su>” added to the acquired document data by character string analysis and reads the sentence unit. Extrude (step S603). Next, the CPU 11 reads out the salience attribute stored in su> (step S604), and in step S601 of the set of words and word reference probabilities (weighted word group) stored in the salience attribute. It is determined whether or not the reference probability of the selected one word is greater than or equal to a predetermined value (step S605).

[0256] If the CPU 11 determines that the reference probability is less than the predetermined value (the selected word is associated with! / NO) (S605: NO), the CPU 11 returns the process to step S603, Subsequent sentence units are read out (S603), and the processes in steps S604 and S605 are performed.

[0257] When the CPU 11 determines that the reference probability is equal to or higher than the predetermined value (S605: YES), the CPU 11 stores the weighted word group read out with the salience attribute in step S604 in the temporary storage area (step S606). .

[0258] CPU 11 determines whether or not the processing up to step S606 has also been executed for step S604 for the entire text unit of the document data acquired at step S602 (step S607). If it is determined that CP Ul 1 is in the whole text unit and the process is executed (S607: NO), CCU11 returns the process to step S603 and reads the subsequent sentence unit (S603). The processes from step S604 to step S606 are executed.

[0259] If the CPU 11 determines that the process has been executed for the whole sentence unit (S607: YES), the CPU 11 determines the weighted word group in which the reference probability of the selected one word is a predetermined value or more for all the document data. It is determined whether or not the force is extracted (step S608). If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is not less than a predetermined value (S608: NO), the CPU 11 returns the process to step S602 and continues to the next step. The document data is acquired (S602), and the processing from step S603 to step S607 is executed.

[0260] If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is greater than or equal to a predetermined value (S608: YES), the CPU 11 is extracted by the process of step S606. Then, a set of weighted word groups stored in the temporary storage area 14 is created by calculating the sum of weight values weighted by the reference probability of one word for each word (step S609). ).

[0261] The CPU 11 determines that the reference probability of one word created in step S609 is a predetermined value or more. The sum of the weighted word groups, that is, the weight value of each word of the summed weighted word group is normalized (step S610).

[0262] CPU 11 selects, in step S601, a weighted word group in which the reference probability of one word normalized in step S610 is greater than or equal to a predetermined value as a related word group having each weight value as a degree of relevance. Is stored in the storage means 13 or in the document storage means 2 via the document set connection means 16 (step S611).

[0263] Next, the CPU 11 of the sentence unit search device 1 determines whether or not it has created and stored related word groups for all the words in the list stored in the storage means 13 (step S612). If CPU 11 creates and stores related words for all words and determines that they are not (S612: NO), CPU 11 returns the process to step S601 and selects the next word ( S601), the processing from step S602 to step S611 is executed for the selected word.

[0264] If the CPU 11 determines that all the words have created and stored related word groups (S612: YES), the CPU 11 ends the process.

[0265] In step S605, the CPU 11 of the sentence unit search device 1 simply performs the following normal process, rather than simply determining whether the reference probability is greater than or equal to the predetermined value. The force may be compared with a predetermined value. For example, the CPU 11 of the sentence unit search device 1 uses the square root of the sum of squares of all reference probabilities so that the sum of the squares of the reference probabilities of each word associated with the sentence unit is “1”. Normalize by dividing

[0266] It should be noted that also in the normalization in step S610, the sum of the squares of the weight values of each word is

Regularly enter a value of 1. For example, the CPU 11 of the sentence unit search device 1 performs normality by dividing each weight value by the square root of the sum of squares of all weight values.

[0267] Next, the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment specifies the related word group created when the processing shown in the flowcharts of Figs. 21 and 22 is performed for one word. An example is shown.

FIG. 23 is an explanatory diagram showing an example of a weighted word group in the course of each process when a related word group is created by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment. In the example shown in the explanatory diagram of FIG. 23, the CPU 11 of the sentence unit search device 1 uses the word “ This is an example in which a weighted word group with a reference probability of “America Village” with a predetermined value (0.2) or more is extracted. FIG. 23 (a) shows the weighted word groups GW, GW, GW extracted by the processing of the CPU 11 in step S605 shown in the flowcharts of FIGS. 21 and 22 and stored in the temporary storage area 14. ing. Figure 23 (b) shows the same for step S607.

one two Three

The weighted word group GW ′, GW ′, GW weighted by the reference probability of one word by the processing of the CPU 11 is shown. Figure 23 (c) shows the CP in step S609 as well.

one two Three

A weighted word group GW ′ ′ weighted and summed up by the processing of U11 is shown.

As shown in FIG. 23 (a), weighted word groups GW, GW, GW having a weight value (reference probability) of one word “America Village” with a predetermined value of 0.2 or more are extracted.

one two Three

[0270] The weight of each word in the weighted word group GW ', GW', GW shown in Fig. 23 (b)

one two Three

The value is multiplied by the weight value (reference probability) of one word “America Village” in each weighted word group. For the word groups GW, GW, and GW shown in Fig. 23 (a), it is shown in Fig. 23 (b).

one two Three

The weight value of each word of the generated word group GW ′, GW ′, GW is as follows.

one two Three

The weight value (reference probability) of the word “America Village” is multiplied. For example, the weight value of each word in the weighted word group G W has an American character because the weight value (reference probability) of America Village is 0.6.

1

Weighted by Rikimura's reference probability, it becomes as follows.

[0271] Word group GW,: (Autumn: 0 (0.6X0), American Village: 0.36 (0.6X0.6), ..., Okumaza: 0

1

(0.6X0), Osaka: 0.12 (0.6X0.2), Oshika: 0 (0.6X0), ...)

That is, as the weight value of one word “America Village” is higher, the influence of the weight value of another word is reflected.

[0273] As shown in FIG. 23 (c), the weight value of each word in the weighted word group GW ,! is one word “American Village” as shown in FIG. 23 (b). The weight values weighted by the weight values (reference probabilities) are summed for each word. The weight value of each word in the word group GW ′ shown in FIG. 23 (c) is summed as follows: the word group GW ′, GW ′, GW shown in FIG. 23 (b).

one two Three

[0274] Word group GW ": (Autumn: 0.03 (= 0 + 0.03 + 0), American Village: 0.49 (= 0.36 + 0.09

+ 0.04), ..., Okuma: 0 (= 0 + 0 + 0), Osaka: 0.28 (= 0.12 + 0. 12 + 0.0.04), Oshika: 0 (= 0 + 0 + 0), ... ·) In addition, the weight value of each word in the weighted word group GW ′ ′ integrated by weighting and summing is normalized by the processing of the CPU 11 of the sentence unit search device 1.

[0276] Regardless of the method of regularity processing, for example, the CPU 11 of the sentence unit search device 1 squares the weight value of each word, calculates the square root of the sum of the squared values, Divide by the weight value of each word and make sure that the weight value of each word in the weighted word group GW '' is normalized.

[0277] Also, the weighted word group GW '' integrated by weighting and summing is a multidimensional vector with each word as one dimension and the weight value of each word as an element in each dimension. When expressed by a relevance vector, the multidimensional vector may be normalized by dividing each weight value (element) by the norm of the multidimensional vector. At this time, the norm is not necessarily the Euclidian norm.

[0278] The weighted word group power as a result of summing and normalizing in this way is created as a related word group of "America Village" by the CP U11 of the sentence unit search device 1. The example shown below is an example of a related word group of the word “Ame Rikimura”. Each word is listed in descending order of weight value.

[0279] Related words (“American Village”) = (American Village: 0. 647, USA: 0. 369, Osaka: 0. 258, Village: 0.159, Defense Camera: 0. 139, Camera: 0. 139, Checkout: 0. 129, Out: 0. 129, Medium: 0.128, Female: 0.120, Male: 0.102, Center: 0. 098, Crime: 0.092, Person: 0. 087 , Takoyaki: 0. 082, Shinsaibashi: 0. 075, Minami: 0. 074, Police: 0. 073, Time: 0. 0 71, Park: 0. 065, Showa: 0. 064, This time: 0.0.63, Number: 0.061, Namba: 0. 060, Mitsu: 0. 060, Land Rover (registered trademark): 0. 059, Rover (registered trademark): 0. 059, Name: 0. 059, Plan: 0. 057, Dotonbori: 0. 055, Tachikawa: 0. 055, Number: 0. 054, Nishitetsu: 0. 053, Sat: 0. 052, Ina: 0. 050, Original sticker: 0. 049, Stet power One: 0. 049, Inn Shinsaibashi: 0.049, Midosuji Line: 0.049, ···)

[0280] The above example is a related word of “America Village” actually created using a document set (GDA tagged daily newspaper corpus http: 〃 www.gsk.or.jp/catalog.html). A group.

[0281] As shown in the specific example of the related term group of “America Village”, for example, when “America Village” is focused, “Osaka” is a related term that attracts more attention than other words. To the weight value Therefore, it can be expressed quantitatively. Therefore, it can be said that the weight value of each word of this related word group represents the degree of relevance to one word. In the above specific example, the degree of association of “America Village” with “Osaka” is 0.258.

[0282] Hereinafter, each weight value of the related word group created for the word w, that is, the word w to the word w

j j k Describe the relevance as b. The related word group of one word w is bw = (w: b, w: b, "', w:

j, k j j 1 j, l 2 j, 2 n b). When representing related terms as relevance vectors, bw = (b, b

J, n J J, l

, One, b).

j, 2 j, n

[0283] The CPU 11 of the sentence unit search device 1 repeats the above-described process for all the words shown in the explanatory diagram of FIG. 6 to create a related word group for each word, and creates the document storage means 2 or the sentence. It is stored in the storage means 13 of the unit search device 1. In this way, weights that represent a group of meanings for each sentence unit are created and stored by creating a related word group in which the degree of association is quantitatively calculated for each word that appears in the document set. It is possible to reflect the influence of related words on the attached word group.

[0284] 3- 5. Quantification of meanings based on associations

Next, the degree of relevance of each word of the created related word group is reflected in the weighted word group stored for each sentence unit, that is, the set of the word and the reference probability of each word or the manifestation vector. Specifically, the sentence unit search device 1 reads the reference probability of each word that has already been calculated and stored, and uses each word's reference probability as a single word weight value as a single word weight value. A value obtained by multiplying the degree of relevance is recalculated and stored.

FIG. 24 shows a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows. The process shown in the flowchart of FIG. 24 corresponds to the process of reassigning the weight value of each word of the weighted word group associated with each sentence unit using the degree of association.

The CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage unit 2 via the document set connection unit 16 (step S71). The CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis, and reads out the sentence unit (step S72). [0287] Next, the CPU 11 reads the salience attribute stored in <su> (step S73), and stores the word and word reference probability pair (weighted word group) stored in association with the salience attribute. Each of the reference probabilities is recalculated to a weight value that takes the association into account using the related word group (step S74). The CPU 11 re-stores each word and a weighted word group (a manifestation vector), which is a set of weight values recalculated in step S74, with the salience attribute added (step S75).

Next, the CPU 11 determines whether or not the sentence unit read in step S72 is the end of the document data (step S76). Whether the current sentence is the end of the acquired document data is determined by whether or not the su> tag follows the su> </ su> that sandwiches the current sentence. If so, it can be determined that it is the end. If the CPU 11 determines that it is not the end of the document data (S76: NO), the CPU 11 returns the processing to step S72 and continues the processing for the next sentence unit. On the other hand, if the CPU 11 determines that the end of the document data (S76: YES), the CPU 11 recalculates the weight value of each word in the weighted word group and associates it with the salience attribute for all document data. Judgment is made as to whether or not the processing to be memorized is completed (step S77).

[0289] If the CPU 11 determines that all the document data has been processed by recalculating the weight value of each word of the weighted word group and storing it with the salience attribute (S77: NO), The CPU 11 returns the process to step S71, acquires another document data, and continues the process. When the CPU 11 determines that the processing for recalculating the weight value of each word in the weighted word group and storing it with the salience attribute is completed for all document data (S77: YES), the CPU 11 ends the processing.

[0290] Note that the CPU 11 of the sentence unit retrieval apparatus 1 realizes the recalculation of the weight value of each word in step S74 by performing the following processing.

FIG. 25 is a processing procedure in which the CPU 11 of the sentence unit search device 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows the detail of. The process shown in the flowchart of FIG. 25 corresponds to a process of multiplying the relevance level of each word by the weight value of the weighted word group, and a process of reassigning the weight value of each word based on the multiplied weight value. [0292] The CPU 11 of the sentence unit search device 1 reads each word of the weighted word group stored in association with the salience attribute read in step S74 of the flowchart of Fig. 24 and the reference probability of each word, and temporarily stores them. Stored in area 14 (step S81). The CPU 11 selects one of the words (step S82), and performs the following processing for the weight value of the selected word.

[0293] The CPU 11 reads the related word group to which the relevance level of each word stored in the storage means 13 or the document storage means 2 is given (step S83). The CPU 11 acquires the degree of relevance from each word to one word from the related word group of each read word (step S84). The CPU 11 multiplies the obtained degree of association from each word to one word by the reference probability of each word stored in the temporary storage area 14, and calculates the sum (step S85).

[0294] This is a weight value representing the manifestation, which is recalculated by taking into account the associations of related words, for the single word calculated by the CPU 11 in step S85.

The CPU 11 determines whether or not the weight value has been recalculated for all the words stored in the temporary storage area 14 in step S81 (step S86). If CPU 11 determines that the weight value has not been recalculated for all the words (S86: NO), CPU 11 returns the process to step S82 and moves to the next word !, from step S82 to step S85. The process of recalculating the weight value is executed. If the CPU 11 determines that the weight value has been recalculated for each word (S86: YES), the CPU 11 returns the process to step S75 of the flowchart of FIG.

Note that the processing for recalculating the weight value by the CPU 11 of the sentence unit search device 1 shown in the flowchart of FIG. 24 and step S74 in the flowchart of FIG. 24 is to calculate the reference probability in the first embodiment. Then, it may be executed in the process of storing it as a weight value representing the manifestation of each sentence unit. Specifically, the configuration may be such that the processing shown in the flowchart of FIG. 25 and step S74 is executed between the processing of step S306 and step S307 in the processing procedure shown in the flowchart of FIG.

[0297] In the processing procedure of the CPU 11 shown in the flowcharts of Figs. 24 and 25, the CPU 11 of the sentence unit search device 1 recalculates the reference probability calculated for each word to a weight value that reflects the association. A specific example is shown below. [0298] For example, when using the relevance group created for the word "America Village", the sentence unit search device 1 calculates the weight value representing the manifestation of "Osaka" in a sentence unit as follows. cure. It is assumed that the relevance group created for “America Village” is “0.3” for “Osaka”. Even if a word stored in association with a sentence unit contains "America Village", the reference probability of "America Village" is 0.4, and "Osaka" is not included. The CPU 11 of the sentence unit search device 1 multiplies the reference probability 0.4 of “America Village” by the relevance level 0.3 from “America Village” to “Osaka” to obtain “Osaka” in the sentence unit. The weight value of is recalculated to “0.12” instead of “0”.

[0299] Here, s is the weight value representing the manifestation in each sentence s of the word w with contextual association.

k i

It is expressed as alienee (w | pre (s;)). Also, the reference probability of each word s in the word w is expressed as Pr (w

k i k i k

I pre (s;)). In this case, if the relevance of word w to word w is reflected, salien i j k

ce (w I pre (s)) = b X Pr (w | pre (s;)) is recalculated. Note that there are other words w with relation k j, k j k to word w, so the degree of association from all words w (j = l, ···, N)

j J

The sentence unit search apparatus 1 recalculates the weight value of each word as shown in the following formula (6).

[0300] [Equation 6]

N

salience {w _k | pre ()) = ^ ^b j, k ^{x Pr} O zo | pre ('))

= 1

[0301] Therefore, the CPU 11 of the sentence unit retrieval apparatus 1 recalculates the weight value of each word w (k = l, ···, N) in the sentence unit S as shown in the following equation (7). .

k

[0302] [Equation 7] V (Sj) = alience {w \ | pre (^)), ···, salienc ^ w ^ | pre ()), _··· , salience (w _N | pre (

= ('…' ', W)) (7)

[0303] Note that, as shown in the second embodiment, the expression in the last line of the expression (7) is a weighted word group, that is, a pair of a word and a word reference probability as a manifestation vector v (s). In this case, the manifestation vector V (k after resolving the association with salienc e (w I pre (s)) as the k-th element

s) represents the principle of calculating the weight value of each word.

[0304] In this case, each bw, ..., bw is a vector of related words for all words w, ..., w

1 N 1 N

Therefore, the relevance vector expressed.

[0305] When a set of weighted words, that is, a set of words and word reference probabilities, is expressed by a multidimensional vector V (s), and related words are expressed by relevance vectors bw,. Each word as in (7)

1 N

The process of recalculating the reference probabilities to the weight values that take the associations into account can be interpreted as follows.

[0306] salience (w | pre (s)) as the k-th element, k i

Toru V (s) is the manifestation i 1 N in the oblique coordinate system based on the relevance vector bw, ..., bw

It can be interpreted as a sex vector v (s). In other words, the manifestation vector V (s) taking into account the association can be interpreted as the manifestation vector v (s) with the reference probability as an element as it is rotated in the direction of the related word axis.

[0307] The oblique coordinate system based on the relevance vectors bw,..., Bw is each unit that reflects the association.

1 N

When the word is one-dimensional, each base vector (vector of size 1 in the direction of each word dimension) is a coordinate system in which the angle between the base vectors of words that are not related to each other and are highly related is small. It is. [0308] When the transformation matrix with each element of b is multiplied by the manifestation vector with the reference probability as the element, j, k

It can be interpreted that the manifestation vector V) rotated in the dimension direction of the related word is obtained.

[0309] Therefore, when a weighted word group representing a group of meanings for each sentence is expressed and stored as an explicit vector, the CPU 11 of the sentence unit search device 1 rotates the manifest vector by the relevance vector ( By performing the process of conversion, a group of meanings for each sentence can be expressed and stored as a manifestation vector with associations added.

[0310] Next, using the relevance group that quantitatively represents the relevance as described above, a process of recalculating the weight value of each word that represents a group of meanings of each sentence unit in consideration of association A specific example of the results of executing is shown below. FIG. 26 is an explanatory diagram showing an example of the contents of weight values representing the manifestation of each word calculated by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment. The weight value of each word for each sentence s, s shown in Fig. 26 (a) is the related word.

1 2

The value of the reference probability before the association is taken into account using the group. On the other hand, the weight value of each word for each sentence s, s shown in Fig. 26 (b) is the value after association is considered using the related word group.

1 2

It is a weight value.

[0311] The specific example shown in Fig. 26 is an example of sentence units extracted from a Japanese spoken language corpus (http: 〃 www. Kokken.go.jp/katsudo/kenkyujyo/corpus 8 CSjZvoll7ZD03F0040).

[0312] As shown in the example of Fig. 26, the weight value of "Osaka" in sentence s in Fig. 26 (b) is

1

The reference probability value of “Osaka” in sentence s of a) is compared with 0 · 3338 and i:

1

ing. The weight value of “Osaka” in sentence s in Fig. 26 (b) is the same as that in sentence s in Fig. 26 (a).

2 2 The value of the reference probability is 0.36675, which is higher than that of 0.3208!

[0313] Furthermore, in the example of the reference probability in Fig. 26 (a), the weight value of "Osaka" in the sentence s is "

Despite the appearance of “2 2 Amerikamura”! /, The weight value has fallen because the influence (excitation) on the weight value of “Osaka” has not been considered. On the other hand, in the example of the weight value after taking into account the association shown in Fig. 26 (b), the weight value of "Osaka" in sentence s is the sentence s.

twenty two

As a result, it appears! /, And the weight value representing the manifestation of “Osaka” becomes higher. This is because the influence of the degree of association between “America Village” and “Osaka” is reflected. [0314] In this way, for the weighted word group stored for each sentence unit by the sentence unit search device 1, the related word group that represents the degree of association using the reference probability and! /, A quantitative value. By using associative associations, it is possible to make the manifestation of “Osaka” closer to the context of the sentence unit or the writer / speaker of the word when “America Village” is attracting attention. Can do. As a result, the weight value representing the manifestation of the word “Osaka” is calculated low, and the meaning of the sentence unit is evaluated quantitatively so that it is separated from the actual context of the writer or speaker. It can be avoided.

[0315] 4. Search processing

Next, the search process in the third embodiment will be described. With regard to “4-1. Receiving words input by the user”, the processing performed by the CPU 41 of the receiving device 4 is the same as in the first and second embodiments, and thus detailed description thereof is omitted.

[0316] 4- 2 '. Quantification of a group of meanings taking into account associations with accepted words

Next, when the CPU 11 of the sentence unit search apparatus 1 receives the data of the words received by the reception apparatuses 4, 4,..., A process of searching for sentences in the document stored in the document storage means 2 Will be described. For accepted words, quantification of meaning groups, that is, word extraction of the text data and word reference probabilities are calculated, and weight values are recalculated using relevance.

[0317] In the third embodiment, the CPU 11 of the sentence unit search apparatus 1 sets a combination of a word and a word reference probability or a manifestation vector, that is, a weighted word, that quantitatively represents the meaning of the accepted word. Add associations with related words to the group. Below, we recalculate the weight value of each word in the weighted word group associated with the word received by the CPU 11 of the sentence unit search device 1, taking the association into account, and perform a search based on the recalculated weight value. The process to be executed will be described.

FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the third embodiment. In the processing procedure shown in the flowchart of FIG. 27, the same reference numerals are used for the same steps as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16 and 17 in the first embodiment. Detailed description is omitted. [0319] Of the processing procedures shown in the flowchart of FIG. 27, the processing in step S4001 surrounded by the two-dot chain line is different from the processing procedures shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. . That is, the difference is that step S4001 described below is added between step S411 and step S412.

[0320] In the following, search processing for associating weighted word groups representing the meaning groups of words accepted in Embodiment 3 and extracting sentence units with similar pre-stored meaning groups will be described below. To do.

[0321] The CPU 11 calculates the reference probabilities in the temporary storage area 14 and narrows down all words stored with reference probabilities greater than or equal to a predetermined value (step S411), and calculates in step S408. The calculated reference probability is recalculated to a weight value that reflects the association (step S4001). In step S4001, the CPU 11 recalculates the weight value reminiscent of the association, as in the process shown in the flowchart of FIG. 25, selects one word at a time, and selects each word for the selected word. Is calculated by multiplying the degree of relevance of each word by the reference probability of each word.

[0322] A set of words and word reference probabilities that quantitatively represent a group of semantic meanings that follow the previously accepted verbal power for the accepted words, taking into account associations. (Weighted word group) could be generated as a search request.

[0323] After that, the CPU 11 reads the weighted word group to which the association is added, which is stored in association with each sentence with respect to the weighted word group to which the association obtained in step S4001 is added. And execute a process of extracting a similar sentence. Since the subsequent processing for the weighted word group to which the association is added is the same as that in the first embodiment, detailed description thereof is omitted.

[0324] As a result, the sentence unit search apparatus 1 is a group of meanings using associated words and taking into account associations between sentences separated from the document data stored in the document storage means 2 and received words. It is possible to directly output a sentence that is judged to be similar. Therefore, by executing the sentence unit search method of the present invention, it is possible to effectively extract sentence units having similar contextual meanings in association with associations and directly output them.

[0325] The CPU 11 of the sentence unit search device 1 associates a weighted word group with the received word, and determines whether the word is similar to the weighted word group stored in advance for each sentence. Judgment In this case, as in the processing procedure shown in the flowchart of FIG. 27, it is not always determined based on whether or not the weighted word group includes the same word. Furthermore, the difference between the weight values assigned to the same word is calculated, and the smaller the calculated difference, the more similar it is not necessarily determined.

[0326] Next, the CPU 11 of the sentence unit search apparatus 1 extracts a sentence unit whose meaning is similar to the accepted word, and expresses the meaning as a manifestation vector and a relevance vector. The case where this is realized by calculating the distance between them will be described below.

[0327] FIG. 28 is a flowchart showing the search processing procedure of the sentence unit search device 1 and the reception device 4 when the vector representation in the third embodiment is used. Note that the processing procedure shown in the flowchart of FIG. 28 is the same as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment and the flowchart of FIG. 19 in the second embodiment. The same reference numerals are used for the respective steps, and detailed description thereof is omitted.

[0328] Of the processing procedure shown in the flowchart of FIG. 28, each step S50 surrounded by the alternate long and short dash line is also processed up to step S506. The processing shown in the flowcharts of FIGS. 15, 16, and 17 in Embodiment 1 Different from the procedure. Instead of the processing from step S412 to step S416 in the first embodiment, processing similar to the processing from step S501 to step S506 executed by the CPU 11 of the sentence unit retrieval apparatus 1 in the second embodiment is performed. In the processing procedure shown in the flowchart of FIG. 28, the processing in step S 5001 surrounded by a two-dot chain line is different from the processing procedure shown in the flowchart of FIG. 19 in the second embodiment. That is, the difference is that step S5001 described below is added between step S501 and step S502.

[0329] The CPU 11 of the sentence unit search device 1 recalculates the manifestation vector calculated in step S501 into an manifestation vector reflecting the association of related words (step S5001).

[0330] After that, the CPU 11 reads the weighted word group obtained in step S5001 in consideration of the association and stores the manifestation vector in consideration of association, which is stored in association with each sentence. , A process for extracting similar sentences is executed. A manifestation vector with associations added The process of reading and extracting a similar sentence is the same as in the second embodiment, and a detailed description thereof is omitted.

[0331] Note that in step S5001 by CPU 11, the process of recalculating the manifestation vector into the manifestation vector taking into account the association with the related word is performed using the relevance vector group (matrix) of the manifestation vector calculated in step S501. Convert (rotate) and calculate as shown in equation (7). Specifically, the manifestation vector V () is calculated by adding the above association to the multidimensional vector v (s) whose element is only the reference probability.

[0332] In the processing procedure shown in the flowchart of Fig. 28 described above, the processing in step S503 for calculating the distance between the manifestation vector associated with the word accepted by CPU 11 and the read manifestation vector Specifically, in the third embodiment, the calculation is as follows. The manifestation vector recalculated with the association of the received word U is represented as V (u), and the retrieved manifestation vector with the association added in advance is represented as V (s). When expressed, the CPU 11 calculates the cosine distance as shown in the following equation (8).

[0333] [Equation 8]

Ι ¾) ΙΙ ¾ ·) Ι

Ν

salience wk I s) salienced | u _t )

― K = \

(8)

N N

'^ alienceiw ^ |) salience (w ^ \ u)'

'= 1

[0334] However, when the distance is calculated as shown in equation (8), the closer the word manifestation vector V (Ui) and the read manifestation vector V (s) are, the closer the calculated cosine distance is. The value increases. Therefore, in step S506, the CPU 11 gives similarities in descending order of the calculated cosine distance.

[0335] Meaning that association was taken into account by CPU11 of sentence unit search device 1 as described above. By using the distance between the manifestation vectors representing the taste clusters, it is possible to directly search sentence units with similar meaning clusters. By using the vector expression, the CPU 11 combines a weighted word group that takes into account the association associated with the accepted word and a weighted word group that takes into account the association stored in advance in association with the sentence. Without comparing the weight values one word at a time, it is possible to determine whether or not they are directly similar to each other after taking the association into account.

[0336] Further, in the case of the sentence unit search device 1 according to the third embodiment, the manifestation vector associated with each sentence unit and the word is the dimension direction of a word having a high degree of relevance in which the dimensions corresponding to each word are not orthogonal. It is handled in an oblique coordinate system in which the angle between them becomes small. For this reason, when comparing the distances between vectors when determining whether or not they are similar, if there is an element in the dimension direction of a word that has a high degree of association, it is determined that the words are similar. Become so.

[0337] Therefore, if the manifestation of "Osaka" is high and the sentence unit s is stored, the sentence unit s is the accepted word when the manifestation of the Dutch word is high in the accepted word. Is not judged to be similar. However, the obviousness of “America Village” in the accepted word is high. When the accepted word is high, the manifestation of “Osaka” is excited and increased, so the sentence unit s is similar to this accepted word. This increases the possibility of being judged.

[0338] Accordingly, it is possible to search for a sentence unit having a similar group of meanings more effectively by taking associations into the accepted words and directly outputting them.

[0339] In the first to third embodiments, the text data received as the search result is displayed on the monitor of the display means 46 provided in the reception device 4, but the received text data is also voiced. A configuration may be adopted in which the signal is converted and output via the speaker of the audio input / output means 47. As a result, the user can obtain a sentence with similar context and meaning as a search result by using multiple words that he / she has input or by inputting a conversation with another user. . When the received words also have spoken language skills, sentences that are omitted in utterances and that have similar word manifestations, including words represented by zero pronouns, can be obtained directly as search results.

[0340] Each time the CPU 11 of the sentence unit search apparatus 1 receives the text data of a word, the text representing the sentence with the highest priority among the sentences searched for the text data. It is good also as a structure which transmits only data to reception apparatus 4, 4, .... In this way, it is possible to present a search result for the input word as an utterance of a third party in the conversation and realize a talk.

[0341] In the first to third embodiments, the sentence unit search apparatus 1 specifies and stores information indicating the manifestation for each sentence, but the tag < A configuration may be adopted in which p> <Zp> is sandwiched, a feature pattern is specified for the paragraph, information indicating the manifestation is stored by the salience attribute, and the paragraph is output as a search result. It is not limited to a sentence or paragraph, but may be a phrase as long as it is a unit that represents a set of certain meanings. In the case of spoken language, the character string that can be identified as one sentence can be very long. It is composed of many phrasing powers, and even though the phrasing and phrasing are continued with a connecting particle such as "~ mo" or "~ so de", the context changes dynamically, and in this case, the meaning of one sentence is summarized. There is a time. Therefore, if the sentence consists of more than a certain number of clauses, it may be configured to process the sentence as if it were one sentence per clause.

[0342] In Embodiments 1 to 3, the document data composed of spoken language is stored in advance separately from the document data that also has writing ability. However, for each received word, A configuration may be adopted in which the document storage means 2 stores the probability every time a word feature pattern is specified and a reference probability is calculated. At this time, the CPU 11 of the sentence unit search device 1 determines whether or not the consecutively received words are a series of words, information for identifying the accepting device 4 that is the transmission source of the words, and the accepting device. It is also possible to use information indicating that 4 has detected a user's search start 'completion operation. As a result, words can be stored in the document storage unit 2 in units corresponding to pages of document data stored in the document storage unit 2 in advance.

[0343] In Embodiments 1 to 3, the sentence unit search device 1 performs all of the processing for obtaining and tagging document data, regression analysis for obtaining the reference probability, and processing when a word is received. However, it may be divided into a sentence unit search device and a document storage device. In this case, the document data is acquired by performing Web crawling in the document storage device, and further, a tag is added to the text data by morphological analysis and syntactic analysis and stored. In addition, an equation for calculating the reference probability is obtained by regression analysis based on the document data stored in the document storage device, and the sentence data is stored for each sentence using the obtained equation. The process of storing the word and the reference probability of the word is performed in advance. The sentence unit search device specifies a feature pattern when text data converted from words is received, acquires a regression formula for calculating a document storage device force reference probability, calculates a reference probability, and performs a search.

[0344] In Embodiments 1 to 3, the input of words such as a character string input or a voice input from the user is converted into text data by the reception device 4, and transmitted to the sentence unit search device 1. The configuration is as follows. However, the present invention is not limited to this, and the sentence unit searching apparatus 1 may be configured to include an input / output unit that receives a user's character string input operation and a voice input unit that receives a user's voice input. FIG. 29 is a block diagram showing a configuration in the case where the sentence unit retrieval method 1 of the present invention is implemented by the sentence unit retrieval apparatus 1. In this case, the sentence unit search device 1 includes a CPU 11, an internal bus 12, a storage unit 13, a temporary storage area 14, a document set connection unit 16, and an auxiliary storage unit 17, as well as a mouse or a keyboard that accepts user operations. It further includes an operation means 145, a display means 146 such as a monitor, and a voice input / output means 147 such as a microphone and a speaker.

In the configuration shown in the configuration diagram of FIG. 29, the CPU 11 of the sentence unit search device 1 detects the frequency or the conversation speed indicating the characteristics of the speech input from the speech input means, and utters it. The feature pattern of each word can be specified. The grammatical feature pattern of each word can be converted to text data by speech recognition and searched based on the text data.

[0346] In the first to third embodiments, the accepting devices 4, 4,... Only convert the received character string or speech word into a certain length, convert it into digital data, and transmit it. It was configured as a device. However, in order to implement the sentence unit search method of the present invention, the receiving device 4, 4,... Receives the program stored in the storage means 43 by the receiving device 4, 4,. The attached words may be configured to perform natural language analysis such as morphological analysis and syntactic analysis, or phonemic analysis. In this case, the CPU 41 of the accepting devices 4, 4,... Calculates a weight value that represents the manifestation of each word in the accepted words, and transmits the calculated weighted word group to the sentence unit retrieving device 1 as a search request. But you can.

Industrial applicability

[0347] The sentence unit search method according to the present invention is a combination capable of voice recognition of conversation between users. By implementing the data processing apparatus, the present invention can be applied to an application in which a computer apparatus participates in a conversation between users and realizes a conversation. It can also be applied to applications that provide a conversation-linked advertisement presentation service that switches according to the flow of conversation or chat context between users. It can also be applied to conference support services that present similar and related minutes from past minutes depending on the context flow during the meeting. Furthermore, it is also possible to apply it to a writing support service that accepts texts in writing as words and provides related information according to the context flow.

Claims

The scope of the claims

[1] Using a document set in which a plurality of document data composed of natural language is stored, the document data obtained from the document set is sorted into one or more sentence units of sentence power, while receiving words, Based on the accepted words, a sentence unit search method for retrieving sentence units separated from the document set is as follows:

A step of storing in advance a weighted word group composed of a plurality of words each assigned a weight value for each sentence unit in association with each sentence unit in the document data;

When a word is accepted, a step of associating the word with a weighted word group composed of a plurality of words to which a weight value for the word is assigned;

A similar sentence unit extraction step of extracting from the document set a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with each other; and

A step of outputting the extracted sentence unit;

The sentence unit search method characterized by including.

[2] The similar sentence unit extraction step includes:

The distribution of the weight values of a plurality of words in the weighted word group associated with the accepted word and the weight values of the plurality of words in the weighted word group associated with the sentence unit that has been sorted in advance. Determining whether the distribution satisfies a predetermined condition;

Extracting a sentence unit associated with a group of weighted words determined to satisfy a predetermined condition;

The sentence unit search method according to claim 1, further comprising:

[3] The similar sentence unit extraction step includes:

Extracting a sentence unit associated with a word group that includes the same word as the weighted word group associated with the received word from sentence units that have been sorted in advance;

Calculating a difference in weight value for each identical word in the associated word group between the accepted word and the extracted sentence unit;

Assigning priorities to the extracted sentence units in ascending order of the calculated difference, and Output extracted sentence units based on priority

The sentence unit search method according to claim 1 or 2, characterized by the above-mentioned.

[4] including a step of calculating the weighted word group as a multidimensional vector having each word as one dimension and having a weight value assigned to each word as a dimensional element corresponding to each word. ,

The similar sentence unit extraction step includes:

Calculating a distance between the multidimensional vector stored for each classified sentence unit and the multidimensional vector associated with the accepted word;

Assigning priorities to sentence units in ascending order of calculated distance; and

Including

Output according to the assigned priority

The sentence unit search method according to claim 1 or 2, wherein:

[5] When associating weighted words with sentence units or accepted words,

A reference probability calculating step of calculating a reference probability that each word appears or is referenced in a sentence unit or a word subsequent to the sentence unit or the word;

Assign the calculated reference probability as a weight value for each word

The sentence unit search method according to claim 1, wherein the sentence unit search method is a sentence unit search method.

[6] The reference probability calculating step includes:

Identifying a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or a feature pattern including a pattern that also refers to the preceding sentence unit power of the word;

Calculating a ratio in which word in which the same feature pattern as the feature pattern is identified in the document data acquired from the document set appears or is referred to in subsequent sentence units;

Including

The calculated ratio is used as the reference probability

The sentence unit search method according to claim 5, wherein:

[7] For each word extracted from the document set, a feature that specifies the feature pattern of the word Constant step,

A step of determining whether a word for which the same feature pattern as the specified feature pattern is specified has appeared or referred to in the subsequent sentence unit in the document data; the specified feature pattern; and the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the determination result for the word specified in

Including

When associating and storing weighted word groups in sentence units, or associating weighted word groups with accepted words,

The reference probability calculating step includes:

The reference probability is calculated using the regression coefficient for the specified feature pattern by specifying a feature pattern of the word in the sentence unit or the word for each sentence unit or the word according to claim 5. Sentence search method.

[8] For sentence units, the ratio is calculated in the document data obtained from the first document set consisting of written words.

For accepted words, the second document collective power that is spoken language. Calculate the ratio in the acquired document data.

The sentence unit search method according to claim 6, wherein:

[9] For each of the first document set composed of written words and the second document set composed of spoken words, the specifying step, the determining step, and the regression step are performed, and the reference probability calculating step includes:

For the feature pattern of the word specified in the sentence unit, the reference probability is calculated using the regression coefficient calculated in the regression step performed on the first document set, and the word specified in the accepted word For the feature pattern, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.

The sentence unit search method according to claim 7, wherein: [10] The feature pattern is:

When referring to the word from the preceding sentence unit or word, the number of sentence units or words from the preceding sentence unit or word to the sentence unit or word containing the word, or the word appearing or referenced The dependency information of the word in the last preceding sentence unit or word being

The number of occurrences or references to a sentence unit or word in which the word is included, the noun distinction of the word in the immediately preceding sentence unit or word in which the word appears or is referenced,

Whether the word is the subject in the last preceding sentence unit or word in which the word appears or is referenced,

Whether the word is the subject in the last preceding sentence unit or word in which the word appears or is referenced;

A sentence unit containing the word or a person in the word,

as well as,

Part-of-speech information in sentence units or words including the word,

Identified by information containing one or more of

The sentence unit search method according to claim 6, wherein:

[11] The feature pattern is:

The preceding sentence unit or word power when the word is referenced from the preceding sentence unit or word, the time corresponding to the sentence unit or word including the word,

Utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or is referenced,

as well as,

The frequency of the speech corresponding to the word in the immediately preceding preceding sentence unit or word in which the word appears or is referenced

Identified by information containing one or more of

The sentence unit search method according to any one of claims 6 to 10.

[12] The sentence gathering power For one word among the extracted words, Among the weighted word groups that are associated with the classified sentence units, the word group includes the one word, and the word group has a weight value equal to or greater than a predetermined value. The first step of extracting

A second step of creating a related word group in which a value obtained by integrating the weight value of each word of the word group extracted in the first step for each word is given as a relevance degree to each word of the first word;

A third step of storing the created related word group in association with the one word; a step of executing the first step to the third step in advance for each of the extracted words;

The weight value of each word of the weighted word group associated with each sentence unit or each accepted word is used as the relevance level of each word of the related word group stored in association with each word. Relevance level addition step

The sentence unit search method according to any one of claims 1 to 11, wherein the sentence unit search method includes:

[13] In the second step,

For the extracted word group, calculating a sum total weighted by the weight value of the one word to the weight value of each word included in each word group;

Averaging the calculated sums;

Assigning an averaged sum of weight values of each word as the degree of association of each word of the related word group to be created;

The sentence unit search method according to claim 12, further comprising:

[14] The relevance adding step includes

For each word in the weighted word group associated with each sentence unit or each accepted word,

Multiplying the weight value of each word of the weighted word group by the degree of relevance of each word included in the related word group stored in association with each word;

A step of reassigning the weight value of each word of the weighted word group based on the multiplication result;

The sentence unit search method according to claim 12 or 13, characterized by comprising: [15] The related word group for each word is a multi-dimensional group in which each word is one-dimensional and the degree of relevance given to each word is an element in the dimension direction corresponding to each word. Calculating as a relevance vector;

Including

The relevance adding step includes

The multi-dimensional vector stored for each sentence unit is converted by the sequence of relevance vectors for each word.

The sentence unit search method according to claim 12, wherein the sentence unit search method is a sentence unit search method.

[16] In a sentence unit search method that uses a document set in which a plurality of document data composed of natural languages is stored, accepts words, and searches the document set based on the accepted words.

Separating document data obtained from the document set into sentence units composed of one or more sentences;

Extracting a word that appears in each sentence unit, or a word that also refers to the preceding sentence unit power in the document data, for each separated sentence unit;

For each word extracted for each sentence unit, identifying and storing features in each sentence unit;

For each separated sentence unit, the combination pattern when the word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the reference pattern when referring from the preceding sentence unit Identifying a feature pattern including:

Storing the identified feature pattern and whether or not the word identified by the feature pattern appears or referenced in subsequent sentence units;

Regression corresponding to the feature pattern by performing regression analysis of the reference probability that the word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained from the document set. Performing regression learning to obtain coefficients,

For each sentence unit,

For each word extracted from the previous sentence unit to each sentence unit in the document data, the regression coefficient corresponding to the feature pattern specified in the sentence unit is used, and the previous coefficient is used. Calculating the reference probability of a written word;

A step of storing in advance a weighted word group to which the calculated reference probabilities are respectively assigned,

If words are accepted, the step of memorizing the words in the order received,

If you accept words,

A step of extracting a word appearing in the received word or a word power to be referred to before the word,

A feature pattern including a step of identifying features in the received words of each extracted word, a combination pattern of features when appearing in a previously received word, or a reference pattern when referring to previously received word power Identifying steps,

Calculating the reference probability of the word using the regression coefficient corresponding to the identified feature pattern;

A step of associating a weighted word group to which the calculated reference probabilities are assigned with the word,

A step of calculating a difference between reference probabilities assigned to each identical word in a weighted word group associated with the received word and a sentence unit that has been sorted in advance;

A step of assigning priorities to the sentence units that have been sorted in advance, the difference in the reference probability being small !, and

Outputting the sentence unit based on the given priority order

The sentence unit search method characterized by including.

In a sentence unit search device that stores a plurality of document data in a natural language, includes means for acquiring document data from a document set, and means for receiving words, and searches the document set based on received words. ,

A means for separating the acquired document data into sentence units composed of one or a plurality of sentences, and a weight consisting of the plurality of words in which a weight value for each sentence unit is assigned to each of the sentence units connected in the acquired document data Means for storing the associated word group in association with each other; Means for storing words in the order received when receiving words;

Means for associating a weighted word group consisting of the plurality of word powers each given a weight value with the word each time a new word is received;

Means for extracting a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with the received word; and means for outputting the extracted sentence unit When

A sentence unit search device comprising:

[18] A computer capable of acquiring document data from a set of documents in which a plurality of document data composed of natural language is stored, means for accepting words, and said document based on the accepted words In a computer program that can function as a means of searching a set,

A means for separating the acquired document data into sentence units composed of one or a plurality of sentences, and a weight value consisting of the plurality of words, in which a weight value is assigned to each sentence unit in the acquired document data. Means for storing word groups in association with each other;

Means for storing words in the order they are received,

Means for associating a weighted word group consisting of the plurality of word powers each given a weight value with the word each time a new word is received; and

Means for extracting a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with the received word

A computer program that functions as a computer program.

[19] A computer-readable recording medium on which the computer program according to claim 18 is recorded.

[20] A means for storing a plurality of document data composed of natural language, and a means for separating the stored document data into one or more sentence units having a sentence power in order of the leading power of the document data. For each unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each separated sentence unit. And

Each of the sentence units connected in the document data is assigned the weight value for each sentence unit. A document storage device comprising means for storing a weighted word group consisting of a number of words in association with each other.

For one of the extracted words,

Extraction means for extracting from the weighted word group associated with each sentence unit a word group that includes the one word and whose weight value is equal to or greater than a predetermined value. When,

Creating means for creating a related word group in which a value obtained by integrating the weight values of each word of the word group extracted by the extracting means for each word as a degree of relevance to each word of the one word; Storage means for storing a word group in association with the one word,

For each of the extracted words, the processing of the extraction means, the creation means, and the storage means is executed, and each related word group is stored in association with each word.

21. The document storage device according to claim 20, wherein: