WO2008023470A1 - Sentence search method, sentence search engine, computer program, recording medium, and document storage - Google Patents

Sentence search method, sentence search engine, computer program, recording medium, and document storage Download PDF

Info

Publication number
WO2008023470A1
WO2008023470A1 PCT/JP2007/055448 JP2007055448W WO2008023470A1 WO 2008023470 A1 WO2008023470 A1 WO 2008023470A1 JP 2007055448 W JP2007055448 W JP 2007055448W WO 2008023470 A1 WO2008023470 A1 WO 2008023470A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sentence
sentence unit
words
weighted
Prior art date
Application number
PCT/JP2007/055448
Other languages
French (fr)
Japanese (ja)
Inventor
Shun Shiramatsu
Kazunori Komatani
Hiroshi Okuno
Original Assignee
Kyoto University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kyoto University filed Critical Kyoto University
Priority to JP2008530812A priority Critical patent/JP5167546B2/en
Publication of WO2008023470A1 publication Critical patent/WO2008023470A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • Sentence search method sentence search device, computer program, recording medium, and document storage device
  • the present invention relates to a search method for searching a large number of document data and searching for a document collective power based on words such as text and voice received by the user for searching.
  • sentence units that can be directly searched for sentence units whose meanings are similar to accepted words from sentence units that are groups of meanings in a document whose meaning changes dynamically in the context flow
  • the present invention relates to a retrieval method, a sentence unit retrieval apparatus, a computer program that causes a computer to function as the sentence unit retrieval apparatus, a computer-readable recording medium that records the computer program, and a document storage apparatus.
  • Conventional document search services include the following. Documents published on the Internet are automatically collected and stored, and for each document, words appearing in the document are stored together with the appearance probability in the document, and words such as keywords or sentences are accepted. In such a case, the document is extracted by assigning priorities in descending order of the probability of occurrence of words included in the keyword or sentence that has received the stored document collective power, and the sentence or sentence including the word is extracted from the extracted document. Output paragraph.
  • a user who uses a document search service needs to think about keywords related to searching information he or she wants to know.
  • the user wants to know about economic policies and international policies. Even if the user's input is in natural language, human beings decide which of the words “America, President, other countries, economy, problems, outbreaks, countermeasures” is most important. It can be grasped when reading, but it is difficult to express quantitatively as the amount of information handled by the device or computer. Therefore, although all the keywords are included, it is assumed that a document describing “American economic problems and countermeasures of presidents of other countries” will be output.
  • the keyword entered for the search is included in the document to be searched, although the keyword entered for the search does not appear frequently but has an important meaning in context. is there.
  • the subject word is expressed with a pronoun or zero pronoun. Therefore, the user who searches for the information he / she wants to know may be the information he / she wants to obtain as a search result, in which the sentence or paragraph in which the keyword input for the search is expressed in the demonstrative pronoun or zero pronoun.
  • priority is given to the search results with the actual appearance frequency, the appearance frequency of the keyword input by the user is low, so it is excluded from the candidates by narrowing down and is not output as the search results.
  • the word in the document is extracted, and the document is subjected to morphological analysis using the part-of-speech information of the word, the dependency information between the words, and the information specifying the anaphoric relationship with the demonstrative pronoun or zero pronoun.
  • a technique has been proposed in which a document is retrieved by a device or a computer, a question is answered, and machine translation is performed based on the stored information. ).
  • Non-Patent Document 1 Hiroshi Hashida “Global Document Modification” The Japanese Society for Artificial Intelligence (11th) Proceedings p p. 62-63 (1997)
  • the user's attention object (priority object) in each sentence or each utterance dynamically changes according to the context or the context flow of the sentence.
  • the weight representing the degree of attention to words in conversations and sentences dynamically changes. Therefore, in order to realize a service that retrieves information related to conversations and sentences, it is necessary to track dynamic changes in word weight according to the context.
  • Non-Patent Document 1 automatically analyzes information that can be identified in the context of grammar, such as part-of-speech information, and supplements, correlates, or depends on demonstrative pronouns or zero pronouns. Information can be added to the document. By adding the information, the noun being referred to can be used as the frequency of appearance, so the relationship between words in sentences or paragraphs can be analyzed with the added information. However, the degree of attention in each sentence or paragraph, ie the manifestation, cannot be measured quantitatively.
  • Non-Patent Document 1 can be applied to the realization of a question response in which a computer responds to a question in a natural sentence in consideration of a word or the like omitted in the question sentence.
  • it is easy to calculate the contextual meaning of conversations by multiple users as a quantitative value, and to generate and present utterances according to the user's conversation context as third party utterances. Not.
  • the present invention has been made in view of such circumstances, and for each sentence unit of one or a plurality of sentence powers, a weight value indicating a word manifestation in the sentence unit is assigned. Word words are associated with each other and stored, and words accepted for search are also associated with weighted word groups assigned weight values in the words, and the weighted word groups are similar. The sentence unit is extracted and output.
  • Sentence units in a document whose meaning changes dynamically in the context flow automatically generating information that reflects the context of the previous word power in the user's consciousness from the received words
  • the sentence unit search method, sentence unit search apparatus, and computer that can directly search sentence units with similar contextual meanings represented by the information generated from the received word power It is an object of the present invention to provide a computer program that functions as a search device, and a computer-readable recording medium that records the computer program.
  • An object of the present invention is to refer to the probability or occurrence of a weight value indicating the manifestation of each word in a weighted word group associated with a sentence unit or a received word in the subsequent sentence unit or word.
  • an object of the present invention is to generate user power by quantitatively calculating the degree of association with related words and reflecting the degree of association in the manifestation of each word in each sentence unit or word. Even if it does not appear in a written word or written sentence, it is effective to use a sentence unit that reminds the user when he / she utters a word, or when writing or writing.
  • An object of the present invention is to provide a sentence unit retrieval method and a document storage device which can be retrieved. Means for solving the problem
  • the sentence unit retrieval method uses a document set in which a plurality of document data composed of natural languages is stored, and the document data obtained from the document set is a sentence that also has one or more sentence strengths.
  • the sentence unit search method that accepts words and retrieves sentence units that are separated from the document set based on the accepted words while separating them into units!
  • the similar sentence unit extraction step is preliminarily classified from the distribution of weight values of a plurality of words in the weighted word group associated with the received word.
  • a step of determining whether or not a distribution of weight values of a plurality of words in a weighted word group associated with a sentence unit satisfies a predetermined condition and a predetermined condition is determined
  • a step of extracting a sentence unit associated with the weighted word group is determined.
  • the similar sentence unit extraction step includes a word including the same word as the weighted word group associated with the received word from the sentence units sorted in advance.
  • a step of assigning priorities to the extracted sentence units in ascending order of the calculated difference, and the extracted sentence units are output based on the priorities.
  • the weighted word group is such that each word is one-dimensional, and the size of the weight value assigned to each word is an element in the dimension direction corresponding to each word.
  • Have A step of calculating as a multidimensional vector, and the step of extracting similar sentence units includes: the multidimensional vector stored for each separated sentence unit; and the multidimensional vector associated with the received word.
  • the method includes a step of calculating a distance and a step of assigning priorities in order of the calculated distance being short V in sentence units, and outputting according to the given priorities.
  • each word appears in the sentence unit or a sentence unit or word subsequent to the word.
  • a reference probability calculation step of calculating a reference probability to be referred to or referred to, and the calculated reference probability is assigned as a weight value of each word.
  • the reference probability calculating step refers to a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or refers to the preceding word unit power of the word.
  • a specifying step for specifying the feature pattern of the word and a feature pattern identical to the specified feature pattern are provided.
  • a determination step for determining whether the specified word has appeared or referenced in the subsequent sentence unit in the document data, the specified feature pattern, and the determination for the word specified by the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the result of the analysis, and storing or accepting a weighted word group in association with each sentence
  • the reference probability calculating step specifies a feature pattern of the word in the sentence unit or word for each sentence unit or word, and uses the identified feature pattern. And calculates the reference probability using said regression coefficients.
  • the first document collective power composed of written words is used to calculate the ratio in the acquired document data.
  • Spoken language ability Second document gathering power Calculate the ratio in the acquired document data It is characterized by that.
  • the sentence unit search method performs the specifying step, the determining step, and the regression step for each of the first document set made up of written words and the second document set made up of spoken words,
  • the reference probability calculation step calculates a reference probability using the regression coefficient calculated by the regression step performed on the first document set for the feature pattern of the word specified in the sentence unit, For the feature pattern of the word specified by the accepted word, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.
  • the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the number of sentence units or words up to the word, the dependency information of the word in the immediately preceding sentence unit or word in which the word appears or is referenced, or the sentence unit or word that contains the word Or the number of times it has been referenced, the noun distinction of the word in the last preceding sentence unit or word in which the word appears or referenced, or in the last preceding sentence unit or word in which the word has appeared or referenced Whether the word is the subject, whether it is the last preceding sentence unit in which the word appears or referenced, whether the word is the subject in the word, the sentence unit in which the word is included, or In words Personal information and sentence units including the word or part-of-speech information in the word.
  • the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the time corresponding to the word, the utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or referenced, and the last preceding sentence in which the word appears or referenced It is specified by information including one or more of voice frequencies corresponding to the word in a sentence unit or word.
  • the sentence unit search method is the weighted word group associated with the sorted sentence unit by one word among the words extracted from the sentence set.
  • a word group including the one word, and the weight value of the one word is predetermined.
  • the first step of extracting a word group that is greater than or equal to the value, and the value obtained by integrating the word weight values of the word group extracted in the first step for each word is defined as the degree of relevance of the one word to each word.
  • a second step of creating the related word group assigned in step 3 a third step of storing the created related word group in association with the one word, and the first to third steps for each of the extracted words
  • Each word of the related word group stored in association with each word, the weight value of each word of the weighted word group associated with each sentence unit or each accepted word.
  • a relevance addition step for re-assigning using the relevance level.
  • the weight value of each word included in each word group is weighted by the weight value of the one word.
  • a step of calculating the added sum, a step of averaging the calculated sum, and an average sum of weight values of each word is given as the relevance of each word of the related word group to be created And a step.
  • the relevance adding step stores each word of the weighted word group associated with each sentence unit or each accepted word in association with each word. Multiplying the degree of relevance of each word included in the related word group by the weight value of each word of the weighted word group, and each word of the weighted word group based on the multiplication result And a step of reassigning as a weight value.
  • the sentence unit search method relates to the related word group for each word, wherein each word is one dimension, and the degree of relevance given to each word is a dimension corresponding to each word.
  • Calculating as a multidimensional relevance vector having a direction element, and the relevance adding step described above uses the multidimensional vector stored for each classified sentence unit as the relevance vector of each word. It is characterized by converting according to a column.
  • the sentence unit search method uses a document set in which a plurality of document data consisting of natural language is stored, accepts words, and retrieves the document set based on the accepted words.
  • the step of separating the document data obtained from the document set into sentence units having one or more sentence powers A step of extracting a word that appears in a sentence unit or a word to be referred to in the preceding sentence unit in document data, and for each word extracted for the sentence unit, a feature in each sentence unit is specified and stored.
  • a step of placing, referring to a pattern of the combination of the features when a word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the preceding sentence unit power A step of specifying a feature pattern including a reference pattern, storing a specified feature pattern and whether or not a word specified in the feature pattern has appeared or referred to in a subsequent sentence unit, A feature pattern is obtained by performing regression analysis of the reference probability that a word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained by the resultant force. Step of executing regression learning to obtain the corresponding regression coefficient, for each sentence unit, each word extracted up to each sentence unit in the document data is identified in the sentence unit.
  • a step of calculating the reference probability of the word using the regression coefficient corresponding to the feature pattern a step of preliminarily storing a weighted word group assigned with the calculated reference probability, If accepted, a step of storing words in the order of acceptance; if a word is accepted, extracting a word that appears in the accepted word or a word that also refers to the word power received earlier than the word; A step of identifying features in the accepted words, a pattern of combinations of features when appearing in previously accepted words, or a first accepting
  • the step of specifying a feature pattern including a reference pattern when referring to, the step of calculating the reference probability of the word using the regression coefficient corresponding to the specified feature pattern, and the calculated reference A step of associating a weighted word group to which probabilities are respectively assigned to the above-mentioned words, for each of the same words in the weighted word groups associated with the received words and sentence units that are sorted in advance.
  • a step of calculating a difference between assigned reference probabilities a step of assigning priorities to sentence units that have been sorted in advance, in order of increasing difference of the reference probabilities, and a priority order to which the sentence units are assigned. And a step of outputting based on.
  • the sentence unit search device comprises means for acquiring document data from a set of documents in which a plurality of document data consisting of natural language is stored, and means for receiving words.
  • a means for separating the acquired document data into sentence units consisting of one or more sentences, and a sentence unit connected in the acquired document data A means for associating and storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence, a means for storing the words in the order received when words are received, and a new
  • a means for associating a weighted word group composed of the plurality of words assigned a weight value in the word a weighted word group associated with the received word from a pre-sorted sentence unit, It comprises means for extracting a sentence unit in which similar weighted word groups are recorded in association with each other, and means for outputting the extracted sentence unit.
  • a computer program has received a computer capable of acquiring document data from a document set in which a plurality of document data composed of natural language is stored, and means for receiving words.
  • a computer program that can function as a means for searching the document set based on words, a means for separating the acquired document data into one or more sentence units, which are connected to the acquired document data Means for storing a weighted word group composed of the plurality of words assigned with a weight value for each sentence unit in association with each sentence unit, means for storing in the order received when words are received, new
  • a means for associating a weighted word group consisting of a plurality of words to which a weight value for the word is assigned It is characterized by functioning as means for extracting sentence units in which weighted word groups similar to weighted word groups associated with received words are recorded in association with the received words.
  • the computer-readable recording medium according to the nineteenth aspect of the invention is characterized in that the computer program of the eighteenth aspect of the invention is recorded.
  • the document storage device is a means for storing a plurality of document data composed of a natural language, and the stored document data is divided into sentence units composed of one or a plurality of sentences in order from the top of the document data. For each sentence unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each sentence unit.
  • Each sentence unit in the document data in the document storage device And means for storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence in association with each other.
  • the one word is included from the weighted word group associated with each sentence unit.
  • a means for creating a related word group given as a degree of relevance to each word of the one word, and a storage means for storing the created related word group in association with the one word, the extracted The processing of the extraction means, the creation means, and the storage means is executed for each word, and each related word group is stored in association with each word. .
  • document data is acquired from a document set in which document data composed of natural language is recorded, and the acquired document data is further one or more.
  • the sentence is divided into sentence units. For each sentence unit, each word that appears in the document set is given a weight value in that sentence unit, and a weighted word group of words assigned the weight value is stored in association with each sentence unit.
  • the weighted word group of words to which the weight value for the word is assigned is also associated with the accepted word.
  • a sentence unit that is associated with a weighted word group similar to the weighted word group associated with the accepted word is extracted from the sentence units that have been sorted in advance and output.
  • the weighted word group stored in advance in association with the sentence unit when extracting the sentence unit associated with the similar weighted word group in the first invention.
  • the distribution of the weight values of multiple words in is similar to the distribution of the weight values of multiple words in the weighted word group associated with the received word by determining whether or not a predetermined condition is satisfied.
  • the sentence unit associated with the weighted word group determined to be similar is extracted.
  • the weighted word group in the first invention is a multidimensional having each word as one dimension and having a weight value given to each word as a dimension element corresponding to each word. Obtained as a vector. Whether or not the weighted word groups are similar is determined based on whether or not the distance between the weighted word groups, that is, the distance between the multidimensional envelopes is short. The extracted sentence units are output in the order in which the distance between the multidimensional outer regions is short, that is, the weighted word groups are similar.
  • the reference probability calculated in the fifth invention is the preceding sentence unit power specified for each word, the pattern of appearance up to each sentence unit, or from the preceding sentence unit Calculated as the rate at which words with the same feature pattern as the feature pattern including the reference pattern appear or are referenced in subsequent sentence units in the document set
  • the feature pattern specified for each word from which document collection power is also extracted, and the word for which the feature pattern is specified has appeared in subsequent sentence units in the document in the document set, or A regression analysis is performed on the determination result of whether the word is referred to, and a regression coefficient of the feature pattern with respect to the reference probability that the word appears or is referenced in the subsequent sentence unit is calculated.
  • the reference probabilities calculated in the fifth invention are calculated from the feature patterns and regression coefficients of each word specified for each word.
  • the document set is divided into a first document set made up of written words and a second document set made up of spoken word power.
  • the reference probability given to each word in the weighted word group associated with the sentence unit is calculated based on the first document set, and the reference given to each word in the weighted word group associated with the accepted word The probability is calculated based on the second document set.
  • each word Dependent information on the number of words up to and including the current sentence unit or word when appearing or referenced in the preceding sentence unit or word Information such as the number of occurrences or references, word noun distinction, whether the word is the subject, whether the word is the subject, word personality, word part-of-speech information, etc. is quantified.
  • the reference probability when the reference probability is calculated in the sixth invention to the tenth invention, it appears or refers to the preceding sentence unit or word as a feature for specifying the feature pattern of each word. If it is, the time from the preceding sentence unit or word, the speech rate corresponding to the word when it appears or referenced, and the information of the high and low frequency of the voice are handled quantitatively.
  • the word from which the document collecting power is also extracted. Then, a weighted word group whose weight value is not less than a predetermined value is extracted.
  • One weighted word group is created as a related word group by integrating the weight value of each word of a plurality of weighted word groups extracted from the one word. The degree of relevance of each word in the created related word group represents the depth of relation to the weight value of each word when a weight value greater than a predetermined value is given to one word.
  • a group of related words is generated and stored for each word extracted from the document set. The weight value of each word of the weighted word group associated with each sentence unit or word is reassigned using the relevance level of each word of the related word group associated with each word.
  • the word group extracted as a weighted word group whose weight value is greater than or equal to a predetermined value is The sum total weighted by the weight value for the one word in the weighted word group is calculated. The sum is averaged, and the sum of the weight values averaged for each word is given as the relevance of each word in the related word group.
  • each word of the weighted word group in which the relevance level of each word of the related word group stored in the twelfth or thirteenth invention is associated with each sentence unit or each accepted word. And the multiplication result is reassigned as the weight value of each word in the weighted word group. If attention is paid to one word in the weighted word group, it corresponds to one word. The relevance level of each word in the related word group is used. Higher relevance is obtained by multiplying the weight value of each word other than one word in the weighted word group by the relevance level of each word of the related word group associated with the one word. The influence of the weight value of the one word from the weight value of another word is taken into account.
  • each word is one-dimensional
  • the degree of relevance given to each word is a dimensional element corresponding to each word. Obtained as a multidimensional relevance vector.
  • the multidimensional vector associated with each sentence unit or word is converted by a matrix of column power of related word vectors for each word.
  • the multidimensional vector is represented by a multidimensional vector in an oblique coordinate system in which the distance between each one-dimensional word is high in the degree of relevance and the distance between the words is short.
  • a multidimensional vector representing a weighted word group has a high degree of association with a word included therein and is rotated in the direction of the word axis, and the distance between the multidimensional vectors including a word with a high degree of association is shorter.
  • a word to be referred to from the sentence unit or the preceding sentence unit is extracted.
  • a sentence pattern is identified, and a feature pattern including a pattern of combination of features leading to each sentence unit or a reference pattern from a preceding sentence unit of each word is identified.
  • the reference probabilities for each extracted word are calculated and stored in advance as sentence-weighted word groups for each sentence.
  • a feature pattern based on the preceding word is also specified for the accepted word, the reference probability of each word is calculated, and a weighted word group is associated.
  • Pre-stored sentence units are output with priorities assigned in ascending order of difference in reference probabilities for the same word as the weighted word group of accepted words.
  • a weighted word group to which a word weight is assigned in that sentence unit is stored in association with each other.
  • a weighted word group in which a weight value for each sentence unit of a plurality of words is assigned to each sentence unit having one or more sentence powers in the acquired document data.
  • the word group with weight values is a set of weight values of each word in each sentence unit, and can be estimated as information indicating a group of meanings in each sentence unit.
  • the weighted word group in each sentence unit in the separated sentence unit is a group of meanings in the whole document.
  • it can be understood as a group of meanings that dynamically change in a time series in the context flow that follows the previous sentence in the document.
  • whether or not the weighted word group is similar is determined based on the distribution of the weight values of a plurality of words in the weighted word group of the accepted words and the weighted word group stored in advance.
  • the weighted word group of words received by the stored weighted word group and It can be said that they are similar.
  • the predetermined condition that can determine that the weighted word groups are similar is a condition that the distribution of the weight value of each word is similar! It can be said.
  • the ratio of the weight value of one word to the weight value of another word, the weight value of one word in the other weighted word group to the weight value of another word When the ratio is also stored, it can be determined that the weighted word groups are similar to each other.
  • the predetermined condition can be determined by setting whether or not the weight value of each word is equal to or greater than a predetermined value.
  • the difference between the weight values of the same word is obtained. It is also possible to determine whether or not it is similar depending on whether it is small or not.
  • the weighted word group as a multi-dimensional vector having each word as one dimension and having the sentence unit of each word or the weight value in the word as an element for each dimension component.
  • a group of meanings for each sentence or word can be treated as a quantitative vector.
  • a sentence unit or a group of meanings for each word as a quantitative multidimensional vector, using a computer capable of vector calculation, a sentence unit stored as a vector associated with the accepted word Similar sentence units can be directly extracted by calculating the distance to the vector associated with each.
  • the conditions that the accepted words or the multi-dimensional vector of sentence units sorted in advance are satisfied are set according to which space in the multi-dimensional space corresponds to power or not. And similar sentence units can be extracted directly.
  • the document set is not limited to V, a set of document data having a so-called written language ability. Therefore, it is not necessarily a sentence unit in which they are separated and a sentence unit that has written language ability.
  • Document data means data that has already been stored and is distinguished from words that are received in real time, and may be document data in which spoken dialogues are written in order.
  • the accepted words are not limited to words, sentences, and the like that are input for the purpose of search, but may be, for example, each utterance including a voice during a dialogue between users.
  • Sentence units are extracted based on weighted word groups to which weight values for each utterance are assigned, so that the meaning is considered considering that the meaning changes dynamically and chronologically for each utterance during the conversation.
  • a cluster can be estimated for each utterance. Therefore, it is possible to extract and present sentence units similar to the presumed meaning group for each utterance.
  • the weight value of each word of the weighted word group is given as a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed.
  • a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed.
  • the reference probability can be expressed as the degree to which each word in the sentence unit is noticed, that is, the manifestation.
  • a reference probability is calculated based on a document set having written language ability, and when a received word is a spoken word, The reference probability is learned and calculated based on a document set that also has spoken language skills. As a result, sentence units with more similar meanings can be output based on the characteristics that differ between written and spoken language.
  • the degree of association from each word is quantitatively calculated and stored for each word.
  • the weight value of each word in the weighted word group is recalculated based on the weight value of the other word and the relevance of the word to each of the word forces.
  • the weight value of one word can reflect the influence of the weight value of a word having a high degree of association with one word among other words. That is, when the weight value of a word having a high degree of association with one word is high, it can be reproduced that the weight value of one word is high.
  • a related word group for one word is expressed as a relevance degree vector and a weighted word group is expressed as a multidimensional vector
  • the multidimensional vector is converted with a matrix composed of a sequence of relevance vectors for each word. This shortens the distance between the multidimensional vectors representing the weighted word group including the words having a high degree of association.
  • the degree of relevance to the one word is high, and the influence of the word weight value is used as the weight value of the one word. It can be reflected. Reflecting the degree of relevance in the manifestation of each word in each sentence unit or word, the sentence unit that appears in the accepted word, even if it is recognized by the user, is effective It has an excellent effect such as being able to search automatically.
  • FIG. 1 is an explanatory diagram showing an outline of a sentence unit search method according to the present invention.
  • FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search device according to the first embodiment.
  • FIG. 3 A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the first embodiment performs tagging and word extraction on the acquired morphological analysis and syntactic analysis processing on the acquired document data and stores them. is there.
  • FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means in the first embodiment.
  • FIG. 5 is an explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of morphological analysis and syntactic analysis and stores in the document storage means.
  • FIG. 6 is an explanatory diagram showing an example of a list of extracted words for all document data acquired by the CPU of the sentence unit search device according to the first embodiment.
  • FIG. 7 The CPU of the sentence unit search apparatus according to Embodiment 1 extracts a sample from the tagged document data stored in the document storage means and performs a regression analysis to calculate the reference probability. It is a flowchart which shows the process sequence which estimates a regression type.
  • FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in document data stored in the document storage unit in the first embodiment.
  • FIG. 9 is a processing procedure for calculating and storing a word reference probability for each sentence of tagged document data stored in the document storage means by the CPU of the sentence unit search apparatus according to the first embodiment. It is a flowchart which shows order.
  • FIG. 10 is a flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 1 calculates and stores a word reference probability for each sentence of tagged document data stored in the document storage means. It is.
  • FIG. 11 is an explanatory diagram showing an example in which the CPU of the sentence unit search device in Embodiment 1 sorts the document shown in the document data for each sentence.
  • FIG. 14 is an explanatory diagram showing how a set of words stored for each sentence by the CPU of the sentence unit search apparatus and a reference probability calculated for the word changes as the sentence continues.
  • FIG. 15 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 16 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 17 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 18 is an explanatory diagram showing an example of a feature pattern specified for text data in which the CPU of the sentence unit search device according to the first embodiment also receives the receiving device power.
  • FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the second embodiment.
  • FIG. 20 is an explanatory diagram showing an outline of the influence of the manifestation of a word closely related to one word, related to the search method of the present invention in Embodiment 3.
  • FIG. 26 is an explanatory diagram showing an example of the content of a weight value representing the manifestation of each word calculated by the CPU of the sentence unit search device in the third embodiment.
  • FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
  • FIG. 28 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
  • FIG. 29 is a block diagram showing a configuration when the sentence unit retrieval method of the present invention is implemented by a sentence unit retrieval apparatus.
  • FIG. 1 is an explanatory diagram showing an outline of the sentence unit search method according to the present invention.
  • 100 in FIG. 1 represents a document set in which a plurality of document data is stored, and one document 101 obtained from the document set 100 is a sentence unit S 1,. ..., S, S, ...
  • 200 in FIG. 1 represents a conversation between user A and user B.
  • the conversation 200 between user A and user B is A set of utterances U, ..., U from time-series users A and B
  • the Conversations are made in the order of utterances U, U, U, U.
  • User A and User B 3 r2 ⁇ j
  • the sentence unit search method provides a degree of attention to each word at the time when the user writes or utters the sentence unit or the word as a quantitative weight value, and assigns it to each word,
  • weighted word groups that reflect the degree of attention to each successive word unit in time series or each word that changes from word to word as an index representing contextual meaning in each sentence unit,
  • the purpose is to directly search and output sentence units having the meaning of.
  • Conversation 200 in the example shown in the explanatory diagram of FIG. 1 is a conversation about travel to Kyoto between user A and user B.
  • Utterance U 200 in conversation 200 is “Kyoto” and “Travel”
  • “Kyoto” and “time” are attracting more attention than “line”, and user A and user B should be able to recognize in common that the contextual implications are changing. Furthermore, “Famous” and “Festival” appear in Utterance U. Considering only the time of U's utterance, the words “Kyoto”, “Travel”, “Time” and “Hot” do not appear. However, at least for user A, utterance U has the meaning of “festival” in “Kyoto” in the “summer” context! Therefore, even at the time of utterance U, “Kyoto” still has weight on contextual implications. It should be noted that user A who utters utterance U should at least recall “Gion Festival” as a word corresponding to the festival.
  • Sentence unit S in this context has the meaning of “Gion Festival” when it comes to “Kyoto” in July.
  • the sentence unit S has the meaning that it is “Gion Festival” or “Gion Festival” in “Summer”, “July”, “Kyoto”.
  • utterance U and sentence unit S are common to “summer”, “Kyoto” and “festival”. Has weights and similar contextual implications.
  • a sentence having a similar contextual meaning is estimated by estimating a group of contextual meanings from the preceding utterance that the user is aware of during the utterance U. Unit s directly
  • the purpose is to search and output k-wise.
  • the computer system can present a relevant information for each utterance and enter into the conversation.
  • the computer system can support the conversation between user A and user B.
  • the computer system outputs an audio message such as “Gion Festival in Kyoto in July” after utterance Uj by user A in conversation 100, user A And talk between User B and the computer system.
  • information such as “Gion Festival for Kyoto in July” is presented by the computer system, so that the conversation between user A and user B Support is also realized.
  • the computer system is made to execute the sentence unit search method according to the present invention.
  • the computer device stores the document data of the document set in advance in units of sentences, and stores quantitative information representing the contextual meaning of each sentence unit in the divided sentence units. Pre-processing including processing to be prepared is required.
  • processing for obtaining quantitative information representing the meaning of the utterance in the conversation flow, and sentence units having similar meanings based on the information obtained for the utterance are extracted.
  • a search process including a process of outputting and outputting as a search result is required.
  • Embodiments 1 to 3 described below a hardware configuration necessary for causing a computer device to execute the sentence unit search method according to the present invention will be described first. Furthermore, the processing by the computer apparatus will be explained step by step by distinguishing the preprocessing and the search processing. Specifically, in each embodiment,
  • Embodiments 1 to 3 described below as an example of executing the sentence unit search method according to the present invention, hardware that stores a document set of document data and an utterance are accepted.
  • a description will be given of a search system that includes a computer device and a computer device that executes a search process by connecting to a computer device that accepts utterances and nodeware that stores a document set.
  • each process and specific example are mainly shown in the case where the document set also has Japanese natural sentence power.
  • the sentence unit search method of the present invention can be applied not only to Japanese but also to other languages.
  • the grammatical handling specific to each language such as language analysis (morphological analysis, syntactic analysis), etc., uses the most appropriate method for each language.
  • FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search apparatus 1 according to the first embodiment.
  • the retrieval system includes a sentence unit retrieval device 1 that executes retrieval processing from document data, a document storage unit 2 that stores document data in natural language, a packet switching network 3 such as the Internet, and a user input. Consists of accepting devices 4, 4,... That accept keywords or words such as speech.
  • the sentence unit search device 1 is PC (Personal
  • the accepting devices 4, 4,... are also PCs, and the sentence unit retrieval device 1 is connected to the accepting devices 4, 4,.
  • the sentence unit search apparatus 1 stores document data including a sentence unit to be searched in the document storage unit 2 in advance.
  • the sentence unit search device 1 The document data stored in the document storage means 2 is classified in advance into sentence units, and quantitative information representing contextual meaning is stored in each sentence unit so that it can be searched later.
  • the receiving devices 4, 4,... Convert the received words into text data or voice data that can be processed by a computer, and transmit the data to the sentence unit searching device 1 via the packet switching network 3.
  • the sentence unit retrieval device 1 extracts one or more sentence units having sentence power from the document data stored in the document storage means 2 based on the received word data, and the extracted sentence units are transmitted via the packet switching network 3.
  • a sentence-by-sentence search is realized by outputting to the receiving devices 4, 4,.
  • the sentence unit search device 1 includes at least a CPU 11 for controlling various kinds of hardware, an internal bus 12 for connecting various kinds of hardware, a storage means 13 including a nonvolatile memory, and a volatile type.
  • Temporary storage area 14 consisting of memory, communication means 15 for connection to the packet switching network 3, document set connection means 16 for connection to the document storage means 2, and portable types such as DVDs and CD-ROMs And auxiliary storage means 17 using the recording medium 18.
  • the storage means 13 stores a control program IP acquired from a portable recording medium 18 such as a DVD or CD-ROM for the PC to operate as the sentence unit search device 1 according to the present invention.
  • the CPU 11 reads out and executes the control program 1P from the storage means 13, and controls various kinds of nodeware via the internal bus 12.
  • the temporary storage area 14 stores information temporarily generated by the arithmetic processing of the CPU 11.
  • the CPU 11 detects that the word data transmitted from the accepting devices 4, 4,... Is received via the communication means 15, executes processing based on the received word data, and performs search processing. Do. Further, the CPU 11 acquires the document data stored in the document storage unit 2 through the document set connection unit 16 and stores the document data in the document storage unit 2 through the document set connection unit 16. It is possible.
  • control program 1P stored in the storage means 13 obtained from the portable recording medium 18 such as a DVD or CD-ROM via the auxiliary storage means 17 is further stored in the storage means 13! Based on the dictionary information, it is possible to execute natural language analysis such as morphological analysis and syntactic analysis on document data expressed in character strings.
  • the accepting devices 4, 4,... include at least a CPU 41 for controlling various types of software, various types of software, Internal bus 42 for connecting the software, storage means 43 composed of nonvolatile memory, temporary storage area 44 composed of volatile memory, operation means 45 such as a mouse or keyboard, and display means 46 such as a motor 46 Voice input / output means 47 such as a microphone and a speaker, and communication means 48 for connection to the packet switching network 3.
  • CPU 41 for controlling various types of software
  • Internal bus 42 for connecting the software
  • storage means 43 composed of nonvolatile memory
  • temporary storage area 44 composed of volatile memory
  • operation means 45 such as a mouse or keyboard
  • display means 46 such as a motor 46
  • Voice input / output means 47 such as a microphone and a speaker
  • communication means 48 for connection to the packet switching network 3.
  • the storage means 43 stores a processing program for the PC to operate as the accepting devices 4, 4,.
  • the CPU 41 reads the processing program from the storage means 43 and executes it, the CPU 41 controls various nodewares via the internal bus 42.
  • the temporary storage area 44 information temporarily generated by the arithmetic processing of the CPU 41 is stored.
  • the CPU 41 can detect a character string input operation from the user via the operation means 45 and store the input character string in the temporary storage area 44.
  • the CPU 41 detects the voice input from the user via the voice input / output means 47, reads the voice recognition program stored in the storage means 43, and executes it as text data. Can be converted. Further, the CPU 41 can input the voice inputted by the user as voice data that can be processed by a computer through the voice input / output means 47.
  • the CPU 41 transmits text or voice word data obtained by detecting a character string input operation or voice input from the user to the sentence unit search device 1 via the communication means 48.
  • the CPU 41 may convert voice data into text data and transmit it, the CPU 41 utters features of voice data obtained by voice recognition, for example, phonemes corresponding to each word. You may also send data such as the speed at the time of being sent and the frequency of the phoneme corresponding to the word.
  • the CPU 41 also stores the time difference between the speech data corresponding to each word, and sends the time difference from the point in time when the word was included in the previously accepted word to the sentence unit search device 1. May be.
  • the sentence unit search apparatus 1 first prepares a document set as pre-processing, and later represents a group of meanings for each sentence unit included in each document data. Process to make it possible. "2. Document data acquisition and In ⁇ Language analysis '', the sentence unit search device 1 stores the document data in the document storage means 2, parses each document data into sentence units that have one or more sentences, The process of analyzing grammatical characteristics for each sentence and storing them in the document storage means 2 for each sentence unit will be described. In the first embodiment, a description will be given of a case where the sentence unit search device 1 uses one sentence as one sentence.
  • the CPU 11 of the sentence unit search device 1 stores document data including the sentence unit to be searched in the document storage unit 2 in advance.
  • the CPU 11 of the sentence unit search device 1 acquires the document data that can be acquired via the communication unit 15 and the packet switching network 3 by Web crawling, and stores it in the document storage unit 2 via the document set connection unit 16.
  • the CPU 11 of the sentence unit search device 1 classifies the document data acquired and stored in the document storage means 2 via the document set connection means 16 into sentence units, and performs language analysis (morphological analysis and syntactic analysis). ) And store the result in association with each sentence unit.
  • FIG. 3 is a flowchart showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment performs tagging and word extraction from the analysis results of the morphological analysis and syntactic analysis processing for the acquired document data and stores them. It is.
  • the processing shown in the flowchart of FIG. 3 is performed by extracting a word that appears in each sentence unit or a word that is referred to from the preceding sentence unit and a feature of each word in each sentence unit. This corresponds to the processing to be stored.
  • CPU 11 determines whether or not it has acquired document data (step SI 1). If the CPU 11 obtains the document data and determines that! /, N! / (SI 1: N 0), the CPU 11 returns the process to step S11 and waits until the document data is obtained. When CPU 11 determines that the document data has been acquired (S11: YES), CPU 11 attempts to read each sentence from the acquired document data and determines whether the reading has succeeded (step S12). .
  • the CPU 11 extracts words that appear in the analyzed sentence and words that are referred to from the preceding sentence in the sentence from the results of the morphological analysis and syntactic analysis, and stores them in the list (step S14). Further, as will be described later, the CPU 11 also generates a tag for the analysis result power (step S15), adds the tag to the read sentence, and stores it in the document storage means 2 via the document set connection means 16. (Step S16).
  • the above processing is performed every time document data is acquired !, and the tagged document data is stored in the document storage means 2.
  • FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means 2 in the first embodiment.
  • the document data stored in the document storage means 2 is stored in the HTML (HyperText Markup Language) obtained from a publicly accessible Web server connected to the packet switching network 3 via the communication means 15 by the CPU 11 of the sentence unit search apparatus 1. ) And other text data.
  • the example shown in Fig. 4 is also a document of HTML data that can be obtained from a web page published on the Internet (http: ⁇ ja.wikipedia.org / wiki / excerpt). In the following, this document example will be used to explain document analysis and retrieval.
  • the CPU 11 of the sentence unit search device 1 converts the character string in the acquired document data into the sentence unit language unit (sentence unit) in the sentence reading process in step S12 shown in the flowchart of FIG. Sort. For example, when the document data is composed of Japanese, the CPU 11 uses a character string representing a punctuation mark “.” Or a character string representing a period “.” If the document data is composed of English. You may sort by.
  • CPU 11 of sentence unit search device 1 performs morphological analysis based on dictionary information for the linguistic unit of "sentence”, identifies the morpheme that is the minimum constituent unit of the sentence, and determines the morpheme structure. To analyze. For example, in the document data shown in FIG.
  • the CPU 11 uses a noun such as “Festival” and “God Spirit”, a proper noun such as “Kyushu”, a verb such as “speak”, “ A morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”.
  • a noun such as “Festival” and “God Spirit”
  • a proper noun such as “Kyushu”
  • a verb such as “speak”
  • a morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”.
  • Various techniques for morphological analysis have been proposed today, and the present invention does not limit the morphological analysis techniques.
  • the CPU 11 of the sentence unit search device 1 uses the part of speech information (nouns, particles, adjectives, verbs, adverbs, etc.) for each identified morpheme, and Japanese grammar and English if it is a Japanese sentence.
  • syntactic analysis is performed to extract grammatical relationships between morphemes based on grammatical information that statistically obtains cohesiveness between parts of speech based on English grammar. For example, by applying a grammar to a tree structure, it is possible to extract the relationship between morphemes according to the tree structure.
  • the analysis target is (adjective + noun + particle + noun)
  • the subject of analysis applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective phrase. When it is determined that the first morpheme is an adjective, it is determined that the adjective is the largest modifier in the analysis target that modifies the noun that follows. In other words, the relationship (adjective + (noun)) is extracted.
  • the remaining analysis target is (noun). If it is determined that it consists of multiple morphemes and is not a noun, it is determined whether or not the remaining analysis target applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective.
  • the first morpheme to be analyzed is an adjective! , The adjective part of (adjective + noun) is expanded to (noun + particle), and it is determined whether or not the remaining analysis target applies to ((noun + particle) + noun).
  • the grammatical relationship between the morphemes of the analysis target (adjective + noun + particle + noun) is [adjective + ⁇ (noun + particle ) + Noun ⁇ ].
  • the method of syntactic analysis is not limited to the method based on such a method, but various methods are proposed today as well as the method of morphological analysis. Does not limit the method of syntactic analysis.
  • the CPU 11 of the sentence unit search device 1 generates document data in which the analyzed morphemes and the grammatical relationships between the morphemes are represented by tags based on XML (extensible Markup Language), and stores them in the document storage means 2.
  • the input character string is morphologically analyzed and further syntactically analyzed to indicate the part-of-speech information of each morpheme and the morpheme information And so on, for each morpheme that is classified.
  • the control program 1P stored in the storage means 13 of the sentence unit retrieval apparatus 1 is configured to allow the CPU 11 of the sentence unit retrieval apparatus 1 to execute the natural language analysis method.
  • the phrase number 0 is (0: Kyushu (noun + proper noun + region + —general, Kyushu, Yuyu) Z region (noun + —general, region, chihou) Z north (noun + —general, region, northern)
  • Z is (particle + case particle + —general, de)
  • Z is (particle + subject particle, is c) z, (symbol + punctuation)), and morphemes are identified and information is added.
  • the morpheme “Kyushu” is a noun, proper noun, a noun indicating the region, and is sometimes used as a general noun.
  • the basic form is “Kyushu”, and it can be determined that the pronunciation is “Kyushuyu”.
  • the dependency information is, for example, (0 2, 1 2, 2 —1) and the dependency relationship between phrases Can be obtained so that can be discriminated.
  • the clause number 0 is the clause number 2 clause
  • the clause number 1 clause is the clause number 2 clause.
  • the phrase number 2 can be identified by having a relationship destination of -1 because there is no dependency destination.
  • FIG. 5 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search apparatus 1 according to Embodiment 1 gives and stores in the document storage unit 2 the results of morphological analysis and syntactic analysis. is there. This corresponds to an example of the document data stored in the document storage means 2 by executing the processing procedure shown in the flowchart of FIG. 3 on the document data having the contents shown in FIG.
  • the CPU 11 of the sentence unit search apparatus 1 sorts a part of the document shown in Fig. 4 into morphemes such as proper nouns, nouns, particles, verbs, etc. Relevance is expressed by nesting tags.
  • the example shown in Fig. 5 is based on the tagging method according to the rules proposed by GDA (Global Document Annotation; see http://i_content.org/gda). The present invention is not limited to complying with the rules. If the computer can identify morpheme information and dependency information between morphemes by information processing, the method is not limited to XML tagging.
  • the tag indicated by ⁇ 511> is a tag representing a sentence (Sentential unit).
  • the sentence “In the northern part of Kyushu is sometimes referred to as (O) kun in the autumn” is the sentence “in the northern part of Kyushu”, “ It can be identified by the tag that it has a unit of three clauses of “There is” and “.
  • the tag indicated by & (1> is a tag that indicates a particle other than the final particle (part icle), adverb, adjunct, etc.
  • the tag indicated by ⁇ n> indicates a noun
  • the tag indicated by ⁇ v> indicates a verb
  • the attribute represented by the attribute name syn indicates a dependency relationship between language units such as clauses or words sandwiched between tags to which the attribute is assigned.
  • Attribute value f (forward) is assigned This means that the linguistic unit that constitutes the sentence is closest to the subsequent linguistic unit. Therefore, in principle, phrase 0 “in the northern part of the Kyushu region” relates to phrase 1 “when it is called (O) kunchi for what happens in the fall”, The term “te (kun)” refers to “Yes” in clause 2.
  • the tag indicated by ⁇ n> can be shown as not being a word on the side where the dependency is received by setting np>.
  • “North Kyushu” can be classified into “Kyushu”, “Region”, and “North”, respectively, with morphemes sandwiched between n>, because “Kyushu” is related to “Region” and “Region” is related to “North”. "Is unnecessary.
  • “events (events, events), festivals”, “events (events, events)” are related to “no” regardless of “festival”. With>, the dependency relationship can be shown.
  • a proper noun representing a place such as "Kyushu” or a proper noun representing a person's name such as "Taro” can be indicated by a tag of placename> ⁇ pername>, respectively.
  • a morpheme referenced from a preceding word or sentence such as a demonstrative pronoun or a zero pronoun can be expressed using an attribute indicating an anaphoric relationship.
  • the attribute name id can be used to indicate whether the pronoun or zero pronoun indicates the preceding word or sentence. For example, for a sentence “There is a button on the right side, please press it”, if a human reads this, it can be naturally supplemented that “it” refers to a “button”. Shi However, when it is processed by a computer, it is not possible to determine what it is to show that “it” can be identified as a directive pronoun by checking against dictionary information.
  • the corresponding relationship can be indicated by the id attribute, the eq attribute, and the obj attribute described above.
  • np id “ Btn ”> button“ Znp> ”on the right side.”
  • ⁇ Np 6 '1 1,. > Is marked with an X.
  • the second sentence "it” indicates "button” and the third sentence " It can indicate that the object of “push” is a “button”.
  • information indicating the result of morpheme analysis is added to the attribute information of a tag such as n> ⁇ ad> ⁇ v> that sandwiches each morpheme with the attribute name mph.
  • the attribute value indicates part-of-speech information, basic form information, pronunciation information, etc. of the morpheme obtained by morpheme analysis.
  • additional information, part-of-speech information, inflected form information, basic form information, and pronunciation information are attribute values
  • mph “additional information; part-of-speech information; inflected form information; basic form information; Information ".
  • “Kyushu” uses part of speech information as a noun + proper noun + It can be classified by region + —general, the basic form is Kyushu, and it can be pronounced “Kyuushiyu” and is clearly indicated by the mph> tag.
  • identification information such as chasen is added as additional information of the morpheme.
  • the CPU 11 of the sentence unit search apparatus 1 tags the document data obtained by Web crawling with the results of tagging the results of morphological analysis and syntactic analysis according to GDA rules.
  • Certain XML data is stored in the document storage means 2 via the document set connection means 16.
  • the CPU 11 of the sentence unit search apparatus 1 identifies the tag of the document data by character string analysis, and identifies the attribute information attached to the tag to identify each attribute data. Can identify morpheme information and grammatical relationships.
  • FIG. 6 is an explanatory diagram illustrating an example of a list of extracted words for all document data acquired by the CPU 11 of the sentence unit search device 1 according to the first embodiment.
  • 31245 words are listed. It should be noted that common words such as “thing” and “thing” are excluded from the stored words. This is because the word is too general like a conjunction or article, and although it appears frequently, the word itself does not make sense, so the search processing is burdensome and inappropriate as a search target.
  • the CPU 11 of the sentence unit search device 1 specifies information that quantitatively represents a group of meanings of the sentence for each sentence in the document data stored in the document storage unit 2.
  • Information that quantitatively expresses the meaning of a sentence means a group of words that the user is paying attention to when the user uses the sentence (speaking, writing, listening, or reading), and the user pays attention to each word. This is expressed by a value (word weight value) that quantitatively indicates the degree of salience.
  • each word in the sentence depends on the frequency of appearance that has been achieved by conventional search services. Therefore, it can also be quantified. However, the appearance frequency is obtained based on the document or the entire document set. Therefore, by calculating the appearance frequency of each word for each document, it is possible to quantitatively represent the meaning of the whole document, but the context changes dynamically according to the flow in the document. It cannot represent a set of meanings that reflect
  • the manifestation of a word in a sentence is grammatically defined by the degree of attention of the word in the preceding sentence and the transition of the degree of attention of the word in the current sentence depending on how the word is used. It can be expressed separately. In other words, if the word that was the subject (subject) in the preceding sentence is also the subject (subject) in the current sentence, the word is the most noticeable in the current sentence, and it is highly obvious. Yes, there is. On the other hand, words that appear in the preceding sentence! /, Na, are the subject (subject) in the current sentence, although they are attracting attention in the current sentence, but continue to be used as the subject mentioned above. It can be said that the obviousness is low.
  • This manifestation formula ⁇ has been studied as a centralized theory (Grosz e t al., 1995, Nariyama, 2002, Poesio et al., 2004).
  • the manifestation of each word is not represented as a feature value for quantitative calculation by a computer or the like. It is only possible to determine whether the transition of each word belongs to one of the transitions defined by the centralization theory. Therefore, the present invention quantitatively calculates the manifestation of each word in each sentence.
  • the reference probability for each sentence is calculated for each word, and the calculated reference probability is assigned as a weight value representing the manifestation of each word for each sentence.
  • the reference probability that a word appears or is referenced in a subsequent sentence is a word that can be analyzed by information processing by the sentence unit search device 1 that does not feature the meaning of a word that is difficult to handle quantitatively.
  • a feature pattern that includes a pattern that appears or includes a reference pattern is identified, and the percentage of words that appear or referred to in the same feature pattern as the specified feature pattern actually appear or referred to in subsequent sentences is used as the reference probability. Calculated.
  • the reference probability for each word is defined as a weight value for each word, and each weight value is assigned.
  • a set of words in the given sentence is called a weighted word group.
  • a group of meanings for each sentence unit can be expressed by a weighted word group to which a quantitative weight value called a reference probability is given.
  • the number of occurrences of the same feature pattern as the specified feature pattern is calculated as the reference probability of the ratio of the same feature pattern in which the word actually appears or referenced in the subsequent sentence.
  • the reference probability can be calculated statistically without any problem.
  • the actual number of identical feature patterns is limited, and enormous amounts of document data are required to calculate reliable reference probabilities. Therefore, a regression equation for predicting whether or not a subsequent sentence appears or is referenced from the feature pattern of a word that is a factor of the occurrence of the event is used as a feature pattern and actually appears or referenced in the subsequent sentence. This is obtained by learning a regression model with the events.
  • Sentences in the document data stored in the document storage means 2 are sandwiched between tags indicated by ⁇ su>, and words that appear in the sentence, or words that have an anaphoric relationship with a pronoun or zero pronoun in the sentence, It can be specified by tag attribute information. Therefore, in the sentence unit search device 1 of the present invention, the feature pattern is specified as follows for the document data stored in the document storage means 2.
  • a sample (s, w) is a pair of one sentence s in the document data and a word w included in a sentence preceding the one sentence in the document data.
  • the feature pattern f (s, w) for the sample is specified by the following feature amount.
  • the feature amount (gram) and the feature amount of the number (chain) in which the word w appears or is referenced in the sentence preceding the sentence s can be given as examples.
  • the feature amount is not limited to this, and may be whether or not the word w is a word indicating a recent topic, or whether or not the word w is a personality.
  • the results of morphological analysis and syntactic analysis are described by tags conforming to the GDA, so they are delimited by the tag ⁇ su> by character string analysis of the document data.
  • Sentence classification and counting, identification of particles based on part-of-speech information indicated by tags within each sentence, and counting of the number of occurrences of words including those referred to by demonstrative pronouns or zero pronouns are possible. Therefore, the CPU 11 of the sentence unit search device 1 can specify the feature quantities dist, gram, and chain for each sample by analyzing the tag and its attribute value according to GDA.
  • the CPU 11 of the sentence unit search device 1 extracts a sample from the tagged document data stored in the document storage means 2, and obtains a feature amount from the extracted sample to identify a feature pattern.
  • the processing procedure for estimating the regression equation by regression analysis is also described to calculate the reference probability of the feature pattern force of the extracted sample.
  • FIG. 7 shows a case in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 extracts a tagged document data force sample stored in the document storage means 2 and performs a regression analysis to calculate a reference probability. It is a flowchart which shows the process sequence which estimates these regression equations.
  • the process shown in the flowchart of FIG. 7 includes a process for identifying a feature pattern for each sentence unit, and a result of determining whether or not the feature pattern and the identified word appear or are referenced in subsequent sentence units. This corresponds to the process of performing regression learning to calculate the reference probability based on.
  • the CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S21).
  • the CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis and sorts it into sentences (step S22).
  • the CPU 11 identifies each tag in su> indicating the sentence by character string analysis, and extracts a sample by associating the word appearing in the sentence or the word to be referred to with the sentence (step S23).
  • a tag is identified by character string analysis for the extracted sample, and a feature pattern consisting of d 1st, gram, and chain is specified (step S 24).
  • CPU 11 determines whether or not the separated sentence is the end of the acquired document data (step S25), and if CPU 11 determines that the separated sentence is not the end of the document data (S 25: NO), CPU 11 returns the process to step S22, and continues the process of sorting by identifying the su> tag in the subsequent sentence. Whether the sorted sentence is the end of the acquired document data is determined by, for example, whether or not it is a force that the tag is followed by SU> ⁇ /SU> that includes the currently sorted sentence. If it is determined that it does not follow, it can be determined that it is the end.
  • step S26 determines whether or not extraction of a predetermined number of samples is completed.
  • CP Ul 1 determines that sample extraction is complete! /, N! / (S26: NO)
  • CPU 11 returns the process to step S21 to obtain different tagged document data. Continue sample extraction.
  • the CPU 11 determines that the sample extraction is completed (S26: YES)
  • the CPU 11 performs a regression analysis on the extracted sample and obtains a regression equation for each feature quantity dist, gram, and chain. Estimate the regression coefficient (step S27) and end the process.
  • FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in the document data stored in the document storage unit 2 according to the first embodiment.
  • the characteristic pattern f (s, Taro-kun) of the sample (s, Taro-kun) of the sentence s in the sentence s shown in Figure 8 and the word “Taro-kun” in the preceding sentence is as follows: Identified.
  • the distance feature (dist) between the current sentence Si and the sentence s where the word “Taro-kun” appeared or referred to recently in the preceding sentence is immediately after s.
  • step S27 shown in the flowchart of FIG.
  • regression analysis is performed based on a logistic regression model.
  • the regression analysis is not limited to this, and other regression analysis methods such as kNN (k—Nearest Neighbors) smoothing + Support Vector Regression (SVR) model may be used.
  • kNN k—Nearest Neighbors
  • SVR Support Vector Regression
  • the regression model can be learned using the following 8 elements as the feature quantities of the feature pattern that can be handled.
  • the following 5 elements can be handled as feature values in addition to the dist, gram, and chain described above.
  • One may be the type of noun (exp, pronoun: 1Z non-pronoun: 0) when the word w is referenced within the previous sentence unit.
  • Another one may be whether the word w is the subject when it appears or is referenced in the previous sentence unit (last-topic, yes: lZno: 0).
  • the other may be whether the word w is the subject when it appears or is referenced in the preceding sentence unit (last—sbj, yes: l / no: 0).
  • the other one may be whether the word w is a personal person (pi, yes: l / no: 0) in the sample, w).
  • Another one may be the part of speech information (pos, noun: 1, verb: 2, etc.) of the word w in the immediately preceding sentence unit when the word w appears or is referenced.
  • Another one may be whether the word w is referenced in the title or heading in the document (in_header, yes: lZno: 0).
  • one of eight elements is the time-dist of the nearest reference location of the word (time—dist), the latest reference of the word.
  • Speaking speed per syllable of the phrase containing the phrase ratio to the average of the speakers) (syllable-speed), frequency ratio of the lowest utterance pitch and the highest utterance pitch of the phrase including the reference part closest to the word Any one or more of (pitch—fluct) can be used.
  • the feature amount of the voice data Even if the regression analysis is performed, the CPU 11 of the sentence unit search device 1 receives the voice data as the word data as will be described later, the feature amount power also calculates the reference probability. can do. [0147] As described above, when the kNN smoothness + SVR model is used, the reference probability can be calculated based on a more detailed feature amount, and a more precise reference probability can be calculated.
  • the word w actually appears or is referenced in the sentence s following the sentence s.
  • the logistic regression model is used for all samples (s, w). Regression analysis. As a result, when the feature quantities dist, gram, and chain are given, the regression equation for calculating the probability Pr (s, w) that the word w appears or is referenced by s i + 1 i + 1
  • the probability obtained by the Logistic Regression model is generally obtained by the following equation (1) with respect to the explanatory variables (features) xl, x2, ⁇ , xn.
  • the regression analysis of the reference probability of the word W in the sentence s calculated by the present invention means that the explained variable is 0, the sample that does not appear or is referenced in the subsequent sentence s, is 0 or appears or is referenced
  • the sample is set to 1, and the explanatory variables are dist, gram, and chain, which are feature quantities, and the extracted samples are learned to estimate the parameters (regression coefficients) b, b, b, and b in the following equation (2) To do
  • Equation (3) that applies these parameters is a recursive equation for obtaining the reference probability.
  • the estimated parameters differ depending on whether the document data stored in the document storage means 2 is useful only for newspaper articles that are written words or only if the utterances that are spoken words are converted to document data.
  • the estimated parameter values b 1, b 2, b 3, and b differ depending on the amount of the document data and the content of the document data.
  • document data is stored separately for written language and spoken language, and parameters are estimated by regression analysis even for document data with spoken language ability. Then, the regression equation for calculating the reference probability is stored. If the words accepted by the accepting devices 4, 4,... Are limited to texts that are written and that can be written by text input instead of speech, the document data is spoken and written. Alternatively, the document storage means 2 may store them without distinguishing them.
  • the CPU 11 of the sentence unit search device 1 can calculate the reference probability of the word having the feature pattern by specifying the feature pattern having the feature quantity dist, gram, and chain power of each word in the sentence unit. .
  • the CPU 11 of the sentence unit search device 1 calculates the reference probability for each word by specifying the feature quantities dist, gram, and chain for each word extracted for each sentence unit. can do. Therefore, the CPU 11 of the sentence unit search device 1 acquires the tagged document data stored in the document storage means 2 and classifies the data for each sentence. A feature pattern is specified for the word that appears in or the word to be referenced, and the reference probability is calculated. As a result, it is possible to quantitatively represent a group of meanings for each sentence that reflects the contextual meaning of the preceding sentence.
  • the CPU 11 of the sentence unit search device 1 acquires the document data stored in the document storage means 2, and for each sentence included in the document data, grammatical of each word in the sentence and the preceding sentence. Specific feature patterns are identified, and the reference probabilities for each word are calculated for each sentence based on the identified feature patterns and regression equations, and stored in advance.
  • the CPU 11 of the sentence unit search apparatus 1 stores a pair of each word and the reference probability of each word (weighted word group) in association with each sentence unit. That is, the CPU 11 performs processing for storing all the texts of all the documents acquired from the document set. On the other hand, the CPU 11 extracts a sentence whose contextual meaning is similar to the accepted word in all the sentences of all the documents in a later search process. Therefore, in this case, it takes a heavy processing load to read out all the sentences of all the documents one by one and read out the weighted word group representing the contextual meaning of each sentence associated with each.
  • the CPU 11 of the sentence unit search apparatus 1 reads out the weighted word group representing the contextual meaning of the preceding sentence for each sentence one by one in the subsequent process.
  • the weighted word group calculated for each sentence is converted into a database and indexed.
  • FIG. 9 and FIG. 10 show that the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2. It is a flowchart which shows the process sequence to take out and memorize
  • the process shown in the flow charts of FIGS. 9 and 10 is a process for calculating a reference probability using a feature pattern identified for each word and a regression coefficient corresponding to the feature pattern for each sentence unit. This corresponds to the process of storing the calculated reference probabilities in pairs with words.
  • the CPU 11 of the sentence unit search device 1 sends the document storage means 2 to the document set connection means 16 via the document set connection means 16.
  • the tagged document data is acquired (step S301).
  • CPUll identifies the tag “ SU >” added to the acquired document data by character string analysis and classifies it into a sentence (step S302).
  • CPUl l identifies each tag in su> indicating the sentence by character string analysis, extracts words that appear in the sentence or words that are referenced in the sentence (step S3 03), and extracts the document. While the reference probability is calculated for the data, the extracted word is stored in the temporary storage area 14 (step S304).
  • the CPU 11 identifies the tag added to the word by word analysis for the word of the document data including the sentence stored in the temporary storage area 14, and also has a dist, gram, and chain force. Is identified (step S305). Next, CPUll calculates the reference probability by substituting each feature quantity of the identified feature pattern into equation (3) (step S306).
  • CPUll determines whether or not the power of calculating the reference probability of each word for the sentence for all the words stored in temporary storage area 14 (step S307). If CPU11 determines that reference probabilities have not been calculated for all words (S307: NO), CPU11 returns processing to step S305 to identify feature patterns for other words and determine reference probabilities. Continue calculation. On the other hand, if the CPU 11 determines that the reference probabilities have been calculated for all the words (S307: YES), the CPU 11 sets the word stored in the temporary storage area 14 and the reference probabilities calculated for each word. The (weighted word group) is stored with the salience attribute added (step S308). At this time, the CPU 11 narrows down the reference probability by a predetermined value, and does not memorize words having a reference probability less than the predetermined value.
  • the CPU 11 performs indexing and weighting so that a set of words and reference probabilities for each word (weighted word group) attached to the current sentence can be extracted later.
  • Store in the word group database step s309).
  • the CPU 11 may store the database in the storage unit 13 or may store it in the document storage unit 2 via the document set connection unit 16.
  • the CPU 11 executes the following process as one of the indexing processes.
  • the CPU 11 pays attention to the reference probability of one word in the weighted word group obtained in step S308, and determines whether or not the reference probability of the one word is greater than or equal to a predetermined value. Next, the CPU 11 determines whether or not the reference probability of another word in the weighted word group is a predetermined value or more.
  • CPU11 refers the calculated weighted word group to one word If a group has a probability greater than or equal to a predetermined value, a group with a reference probability of one word less than a predetermined value, and belongs to a group with a reference probability of one word greater than or equal to a predetermined value, then another word It is determined whether the group belongs to a group having a reference probability equal to or higher than a predetermined value or a group having a reference probability of another word lower than a predetermined value.
  • the CPU 11 determines to which group the weighted word group calculated by repeating such processing belongs, and stores it in association with the identification information of the group to which it belongs. For example, a kd tree search algorithm can be applied to this indexing process.
  • CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to the entire sentence in the document data acquired in step S301 is completed (step S310).
  • the CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to all sentences in the document data is as follows. For example, after su> ⁇ Zsu> that sandwiches the current sentence, it is determined whether or not it is followed by a su> tag. If it is determined that it does not follow, it can be determined to be the end. If CPU 11 determines that the process of associating the weighted word group for each sentence is not completed for all sentences in the document data acquired in step S301 (S310: NO), CPU 11 returns the process to step S302. Continue processing for the next sentence.
  • step S31 1 determines that the processing for associating the weighted word group for each sentence is completed for the entire sentence in the document data acquired in step S301 (S31 0: YES).
  • the CPU 11 extracts the document data. Then, the word stored in the temporary storage area 14 is deleted (step S311).
  • the CPU 11 determines whether or not the process of storing the word and the word reference probability with the salience attribute is completed for all document data (step S312). If CPU11 determines that the process of storing the word and the word reference probability with the salience attribute has not been completed for all the document data (S312: NO), CPU11 returns the process to step S301 and The document data is acquired and processing continues. If the CPU 11 determines that the processing of storing the word and the word reference probability by the salience attribute is completed for all document data (S312: YES), the CPU 11 calculates the word reference probability and stores it in advance. The memorizing process is terminated.
  • FIG. 11 is an explanatory diagram showing an example in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment classifies the document shown in the document data for each sentence.
  • the CPU 11 of the sentence unit search device 1 identifies ⁇ su> tags from the document data stored in the document storage means 2 and separates them for each sentence by the processing of step S301 and step S302.
  • the sentence is s “Festival is a ritual that enshrines spirits, etc.”, s “Festival, ritual
  • the word from which the sentences s, s, and s force are also extracted by the processing of step S303 by the CPU 11 of the sentence unit retrieval apparatus 1 is the word stored in the word list.
  • the CPU 11 of the sentence unit search apparatus 1 uses the sentence s of each word group by the process of step S305.
  • the feature pattern consisting of the feature quantities dist, gram, and chain of each word group is specified. For example, “Kyushu” (identification number: 9714) in sentence s (
  • the characteristic pattern (see Fig. 6) is specified as follows.
  • the CPU 11 of the unit search device 1 calculates the reference probability by substituting the values of the feature quantities dist, gram, and chain into the equation (3) by the process of step S306 in the flowcharts of FIGS.
  • Equation (4) the reference probability of “Kyushu” in the sentence s is calculated as 0.238.
  • the reference probability is stored for the sentence s.
  • CPU11 of sentence unit search device 1 The reference probability is stored for the sentence s.
  • the word is represented by an identification number stored in a list, and the reference probability is stored in association with it.
  • the attribute name salience is defined for the su> tag that separates sentence units, and the attribute value is defined as a list of word identification numbers and reference probabilities. Stores the word and the reference probability (weighted word group) of the word.
  • FIG. 12 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search device 1 according to the first embodiment gives the result of calculating the reference probability and stores the result in the document storage unit 2.
  • the reference probability of “Kyushu” (9714) (weight value in sentence s, and so on) is 0.238
  • the reference probability of “North Kyushu” (9716) is memorized as 0.1159,...
  • FIG. 13 is an explanatory diagram showing an example of the contents of a database when the CPU 11 of the sentence unit search device 1 according to Embodiment 1 indexes and stores weighted word groups calculated for each sentence unit. .
  • the content example in FIG. 13 is associated with the sentence s shown in the content example in FIG.
  • the CPU 11 stores the weighted word group in association with information (k-d tree node ID) indicating to which group it belongs. Further, at that time, the CPU 11 identifies the file name of the tagged document data and the position in the document data so that the weighted word group is associated with the sentence unit of the misaligned document data.
  • tag information This makes it easy to extract sentence units associated with weighted word groups similar to the weighted word groups obtained for words received in later processing.
  • FIG. 14 shows how the set of words stored for each sentence by the CPU 11 of the sentence unit search apparatus 1 and the reference probabilities calculated for the words change as the sentence continues.
  • FIG. 14 context continues in time series as sentence s, sentence s, sentence s, sentence s continue.
  • the search process is based on the reception of keywords such as keywords or speech input by the receiving devices 4, 4,. Start as a point.
  • the CPU 41 of the accepting device 4 detects a character string input by the user via the operation means 45 and stores it in the temporary storage area 44, or a voice input by the user via the voice input / output means 47. Can be detected, converted into a character string, and stored in the temporary storage area 44.
  • the CPU 41 of the accepting device 4 has a function of analyzing a character string input by the user and separating it into one sentence and one sentence. For example, a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified.
  • a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified.
  • each time the Enter key is pressed is detected via the operation means 45, the character string until the Enter key is input may be separated from one sentence.
  • the voice may be converted into a character string by the voice recognition function, and may be classified into sentences by the converted character string analysis. May be separated.
  • the CPU 41 of the accepting device 4 transmits the sorted sentences and sentences as text data to the sentence unit retrieval device 1 via the communication means 48.
  • the CPU 11 of the sentence unit search device 1 receives text data indicating the words accepted by the accepting devices 4, 4,..., It searches for sentences in the document stored in the document storage means 2. Processing to be performed will be described.
  • text data indicating accepted words quantification of meaning groups is performed, that is, word extraction of the text data and calculation of word reference probabilities.
  • information indicating a group of meanings reflecting the context corresponding to the flow from the preceding words in the user's latent consciousness when the user inputs words can be used as a search request in the search processing described later. Can be created automatically.
  • the temporary storage area 14 stores the text data.
  • Text data is stored in the order received, and morphological analysis and syntactic analysis are performed on the sentence indicated by the received text data.
  • the CPU 11 of the sentence unit retrieval apparatus 1 identifies the characteristic notation f (s, w) of the word w in the sentence s of the received text data, the identified characteristic pattern and the previously obtained regression equation Based on this, the reference probability is calculated.
  • the CPU 11 of the sentence unit search device 1 calculates a reference probability for each word, and uses the word and the reference probability calculated for each word to store a weighted word group that is already stored in association with the sentence unit. In other words, a sentence-by-sentence search is performed by comparing each word with a set of reference probabilities for each word.
  • the CPU 11 of the sentence unit search device 1 can receive not only text data but also speech data of utterances input by the user from the reception devices 4, 4,. In this case, the same processing is performed by specifying the grammatical feature pattern of the words expressed in the voice data as in the text data.
  • speech data it is also possible to treat features obtained from speech data as features for determining whether or not the word is highly apparent. For example, when a word appears or is referenced, the CPU 11 can treat the time difference from the appearance or reference of the preceding word as one feature quantity. Further, the CPU 11 can treat the speech speed and the Z or speech frequency when the word is uttered as other feature quantities in the latest preceding words where the word appears or is referenced.
  • the accepting device 4 accepts a word input from the user and sends it to the sentence unit retrieval device 1, and the document storage means 2 uses the CPU 11 of the sentence unit retrieval device 1 based on the text data received from the acceptance device 4.
  • a processing procedure for storing and searching from document data will be described with reference to a flow chart.
  • FIG. 15, FIG. 16, and FIG. 17 are flowcharts showing the processing procedure of the search processing of the sentence unit search device 1 and the reception device 4 in the first embodiment.
  • the CPU 41 of the accepting device 4 determines whether or not the user has detected a character string input operation via the operation means 45, or whether the user has detected a voice input via the voice input / output means 47. Judgment is made (step S401). If the CPU 41 determines that the character string input operation or voice input by the user has not been detected (S401: NO), the CPU 41 returns the process to step S401 and detects the character string input operation or voice input by the user. Wait until [0193] On the other hand, if the CPU 41 of the receiving apparatus 4 determines that the user has detected a character string input operation or a voice input (S401: YES), the CPU 41 of the receiving apparatus 4 receives the input character string or voice input. From the converted character string, the input words are separated into one sentence and stored in the temporary storage area 44 (step S402), and the input words are also transmitted to the sentence unit search device 1 via the packet switching network 3 (step S402). Step S403).
  • the CPU 11 of the sentence unit search device 1 receives the word input by the user from the reception device 4 (step 3404). 1; 11 stores the received words as text in the temporary storage area 14 as text data in the order of reception (step S405). At this time, a sentence identification number may be added to each text data and stored.
  • the CPU 11 performs morphological analysis and syntactic analysis on the stored text data (step S406), and stores the words extracted by the analysis in the temporary storage area 14 (step S407). At this time, the CPU 11 compares the word stored in the list with the identification number of the list and stores the word.
  • step S407 of the sentence unit search device 1 the temporary storage area 14 stores words that have appeared or referred to once in a series of words (utterances) input. become.
  • word extraction in step S407 is not necessarily performed. In that case, a feature pattern specific process described later is performed on all words stored in the list.
  • CPU 11 calculates a feature pattern based on the text data received and stored in the past and the results of morphological analysis and syntactic analysis in step S406. Identify (step S408). The CPU 11 substitutes the feature quantity of the identified feature pattern into a regression equation for calculating a reference probability obtained by performing regression analysis on the spoken language in advance, and calculates a reference probability for each word (step S409). The CPU 11 determines whether or not the reference probabilities have been calculated for all the words stored in the temporary storage area 14 (step S410). If the CPU 11 determines that the reference probabilities have not been calculated for all the words stored (S410: NO), the process returns to step S408 to specify the feature pattern and reference probabilities for other words. The calculation process is performed.
  • the reference probabilities are calculated and stored in the temporary storage area 14 respectively, and the words having the reference probabilities of a predetermined value or more are narrowed down (step S411). This is to reduce the load on the CPU 11 itself by the subsequent calculation by removing words with extremely low reference probabilities.
  • the CPU 11 performs the following search processing based on the words narrowed down to the accepted words and the word reference probabilities.
  • search for pairs of words and word reference probabilities that quantitatively represent the group of semantic meanings that follow the previously accepted word power
  • the following search processing uses a weighted word group obtained for received words and a weight word group for each sentence stored in advance. Comparing and determining whether or not words and sentences have similar meanings based on whether the weight value distributions of multiple words in each weighted word group are similar, and extracting similar sentences It is an example of the process to perform.
  • the CPU 11 reads from the database of the storage means 13 or the document storage means 2 a pair of words and word reference probabilities stored in association with each sentence (hereinafter referred to as a weighted word group) (Ste S412).
  • the weighted word group associated with the accepted word obtained by the processing up to step S411 is stored in the database so that the CPU 11 can narrow down and read the weighted word group somewhat similar. Similar to the weighted word group stored in, it is determined which group it belongs to. The CPU 11 reads the database power of the weighted word group of the group to which the weighted word group associated with the received word belongs. As a result, it is possible to avoid comparison with weighted word groups that are not similar at all, and to narrow down and extract weighted word groups that are somewhat similar.
  • the CPU 11 extracts a weighted word group including the same words as the weighted word group of the received word from the weighted word group read out in step S412 (step S413).
  • the CPU 11 calculates a reference probability difference for each word that is the same as the extracted sentence (step S414).
  • CPUl l assigns similarities to the extracted weighted word groups in descending order of the number of identical words and the difference in reference probability S of the same words (step S4 15), and the extracted weighted word groups Whether the sentence associated with is a document set document data (Step S416).
  • the CPU 11 may read a sentence corresponding only to a weighted word group having a similarity equal to or greater than a predetermined value.
  • the CPU 11 sorts the extracted sentences by similarity (step S417).
  • the CPU 11 transmits text data representing each sentence as text data of the search result to the accepting device 4 via the communication means 15 (step S418).
  • the CPU 41 of the accepting device 4 receives the text data of the search result via the communication means 48.
  • Step S419) the received text data is displayed on a monitor or the like via the display means 46 (Step S420), and the process is terminated.
  • the CPU 41 of the accepting device 4 transmits text data or speech data separated into one sentence to the sentence unit searching device 1 each time an input of a user power word is detected.
  • the CPU 11 of the sentence unit search device 1 calculates a word and a reference probability for each word each time it receives text data or voice data, or information transmitted together with the voice data from the reception device 4, and converts it into a word received from the user.
  • information representing a group of meanings reflecting the flow of preceding word power that is, a weighted word group is created as a search request.
  • the CPU 11 of the sentence unit search device 1 extracts sentence units from the stored document data based on the search request (weighted word group) created for the accepted words, and sends the text data as the search results.
  • the CPU 41 of the accepting device 4 in the first embodiment displays the text data of the search result on the monitor or the like each time it is received. Therefore, every time a user-spoken word is input, the reception device 4 displays text data similar in meaning to that word as a search result.
  • the receiving device 4 does not necessarily have to be configured to transmit text data each time a user spoken word is input and to receive and display a search result. For example, a configuration in which text data or voice data corresponding to a plurality of words input during a predetermined period is transmitted to the sentence unit search device 1, and search results corresponding to the plurality of words are received and displayed. Good.
  • FIG. 18 is an explanatory diagram showing an example of a feature pattern identified for text data received from the receiving device 4 by the CPU 11 of the sentence unit searching device 1 according to the first embodiment. Sentence unit S, sentence unit S, and sentence unit S in Fig. 18 are indicated by the received text data.
  • the regression analysis is performed on the document data stored in the document storage means 2, and when the feature pattern is specified, the reference probability is calculated by substituting the feature amount.
  • a regression equation that can be used is derived in advance. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability for the “snoopy” of the sentence s based on the feature quantities dist, gram, and chain of the identified feature pattern. Further, the CPU 11 of the sentence unit search device 1 calculates a reference probability including a word that has appeared or referred to in the past for the sentence s, and obtains a word and a reference probability of the word.
  • the CPU 11 of the sentence unit search device 1 determines the reference probability of the same word from the sentence unit in which the salienc attribute stored in the document storage unit 2 is stored in advance based on the obtained word and the reference probability. A sentence unit that is greater than or equal to is directly extracted. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15.
  • the meaning of words represented by the received text data is expressed by word and word reference probability (weight value) for each word. be able to.
  • word and word reference probability weight value
  • a word representing a group of meanings and word reference probabilities are stored. Sentences whose meanings are similar can be directly searched based on whether or not the extracted words have similar reference probabilities.
  • a pair (weighted word group) of the extracted word and the reference probability calculated for each word is used as the manifestation vector. deal with. Furthermore, a pair (weighted word group) of a word calculated for an accepted word and a reference probability calculated for each word is also treated as a manifestation vector. Then, at the stage of the search process, as shown in the first embodiment, the weight value distribution of the plurality of words in the weighted word group of the accepted words and the weighted word previously associated with each sentence.
  • each weighted word group is represented by a manifestation vector, and whether or not the condition is a similar condition is determined by the shortness of the distance between the manifestation vectors.
  • information that quantitatively represents a group of meanings for each sentence is used by the user when the user uses the sentence (speaking, writing, listening, or reading). It is expressed as a group of words that the user is interested in, and a value (word weight value) that quantitatively indicates the degree to which the user pays attention to each word, that is, the salience.
  • word weight value a value that quantitatively indicates the degree to which the user pays attention to each word.
  • the manifestation Use a reference probability that indicates the probability that it will appear or be referenced in subsequent sentences as a quantitative weighting value.
  • the reference probability includes the regression coefficient obtained by the regression analysis on the sample of the document data stored in the document storage means 2, as in 3-1. Regression model learning of the first embodiment. Calculate using regression equation.
  • the CPU 11 of the sentence unit search apparatus 1 uses the regression formula including the regression coefficient obtained by the regression analysis to identify the feature quantities dist, gram, and chain for each extracted word.
  • the reference probability for each word can be calculated.
  • a weighted word group is obtained by assigning the reference probability for each word as the weight value of the word.
  • the weighted word group that represents a group of meanings for each sentence has a one-dimensional word, and has a reference probability calculated for each word as an element of a dimension component corresponding to each word. Treat as a tuttle. That is, the meaning of sentences in the document data stored in the document storage means 2 is extracted from the document data stored in the document storage means 2 and stored in the list shown in FIG. It can be represented by a vector in dimensional space.
  • the document data to be stored in the document storage means 2 with the result of the CPU 11 of the sentence unit search apparatus 1 calculating the reference probability in the second embodiment stored in the document storage means 2 is shown in the explanatory diagram of FIG. 11 of the first embodiment. This is the same as the document data shown. That is, the document data stored in the document storage means 2 stores the dimension number and the value of the reference probability that is an element of the dimension component.
  • the CPU 11 of the sentence unit search apparatus 1 according to the second embodiment calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2 and associates it with each sentence. Since the processing procedure stored in the database is the same as that in the first embodiment, the explanation is omitted.
  • a process for searching for a sentence in a document stored in the document storage unit 2 when the CPU 11 of the sentence unit searching apparatus 1 receives text data indicating a word received by the receiving apparatus 4 will be described.
  • the CPU 11 of the sentence unit search apparatus 1 also represents a set of contextual meanings of the accepted words as textual manifestation vectors indicating the directionality in the multidimensional space of the words for the text data indicating the accepted words. .
  • the CPU 11 of the sentence unit search device 1 uses the feature amounts dist, gram, and 31245-dimensional words stored in the list for the text data received from the reception device 4. Specifies the feature pattern represented by chain. It should be noted that if it appears in the text data received as a series in the past, the feature pattern specification is omitted by setting the corresponding dimension component element to 0 for the word.
  • the reference probabilities as elements of the dimension component can be calculated based on the regression equation. Therefore, each time the text data is received, the CPU 11 of the sentence unit search device 1 can calculate a manifestation vector representing a group of meanings in the context of the word indicated by the received text data.
  • the CPU 11 of the sentence unit search device 1 compares the manifestation vector calculated for the received word and the manifestation vector of the sentence with the salience attribute added in advance stored in the document storage means 2. The distance is directly calculated by an outer calculation, and a sentence with a short distance is extracted. Sentences with similar directionality of meanings can be searched in a 31245-dimensional multidimensional space where each word in Fig. 6 is one-dimensional.
  • the CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15. Vector operation If you use a computer that can handle, you can directly calculate the meaning of each sentence as a manifestation vector.
  • FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the second embodiment.
  • the same reference numerals are used for the same steps as the processing procedures of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. Detailed description is omitted.
  • the CPU 11 of the sentence unit search device 1 narrows down to words for which a reference probability equal to or greater than a predetermined value is calculated for all words stored in the temporary storage area 14 by calculating the reference probabilities (steps).
  • the manifestation vector of the accepted word is calculated based on each narrowed word and the calculated reference probability of each word (step S501).
  • a manifestation vector that quantitatively represents a group of meanings in the flow following the previously accepted word power can be generated as a search request for the accepted word.
  • the following processing compares the manifestation vector obtained for the accepted word and the manifestation vector of each sentence stored in advance, and calculates the weight value of each word represented by each manifestation vector. It is an example of the process which determines whether it is a force with similar distribution.
  • CPU 11 reads the weighted word group stored in the database, that is, the manifestation vector (step S502). At this time, the obviousness vector force stored in the manifestation vector force database associated with the accepted words obtained in the processes up to step S411 is used. In the same manner as in the above, it is determined to which group it belongs. The CPU 11 reads the manifestation vector of the group to which the manifestation vector associated with the accepted word belongs from the database. As a result, it is possible to narrow down and extract a manifestation vector having a similar distribution of weight values for each word.
  • CPU 11 calculates the distance between the saliency vector associated with the accepted word and the read saliency vector (step S503).
  • the CPU 11 narrows the read manifest vector to the manifest vector whose calculated distance is less than the predetermined value (step S504), and reads the sentence that is stored in association with the narrowed manifest vector. (Step S505) o
  • the CPU 11 gives similarities to the read sentences in order of increasing calculated distance (step S506).
  • step S501 to step S506 by the processing from step S501 to step S506 by the CPU 11 of the sentence unit searching apparatus 1 in the second embodiment, sentences having similar contextual meaning to the accepted words are extracted.
  • step S417 for the extracted sentence is the same as in the first embodiment.
  • step S503 for calculating the distance between the manifestation vector associated with the words received by the CPU 11 and the read manifestation vector in the above-described processing procedure is concretely. Calculate as follows. When the manifestation vector associated with the accepted word U is represented as v (u) and the retrieved manifestation vector force (s), the CPU 11 calculates the cosine distance as shown in equation (5) below. Is calculated.
  • step S506 the CPU 11 assigns similarities in descending order of the calculated cosine distance.
  • weight values representing the manifestation of each word are deeply related to the word.
  • a recalculation process is performed taking into account associations from other words.
  • an association is a case where a word in a group of weighted words associated with each sentence unit does not appear in that sentence unit or the preceding sentence unit. If the word is deeply related to the word and the word is highly apparent, it means that the word is also attracting attention in units of sentences. Therefore, when a single word is attracting attention, a word that is easily noticed at the same time is taken as a related word. Then, the influence of the visibility of closely related words is reflected in the weight value representing the manifestation of each word.
  • FIG. 20 is an explanatory diagram showing an overview of the influence of the manifestation of one word and a word closely related to the search method of the present invention in the third embodiment.
  • the explanatory diagram of FIG. Represents an example of conversation between users.
  • a conversation is a set of utterances u, U, U, U
  • the value may have dropped.
  • the rate should have a high value.
  • the weight value representing the manifestation of each word associated with each sentence or word is recalculated in consideration of the manifestation of the related word (related word).
  • the sentence unit search device 1 In order to recalculate the reference probability to a weight value that takes into account the manifestation of related words, the sentence unit search device 1 first obtains information representing the power of which the words are closely related to each other. It is necessary to keep it. Then, the influence of the relevance representing the depth of the relation is reflected in the reference probability of each word calculated for each sentence unit. Specifically, for example, when the above example is used, the degree of association of “America Village” with “Osaka” is quantitatively calculated. Next, it is calculated as a weight value that represents the manifestation of “Osaka” on a sentence-by-sentence basis by reflecting the effect of the relevance to “Osaka” on the reference probability of “America Village”. Recalculate and store.
  • the sentence unit search device 1 creates a weighted related word group for one word to which the relevance of each word to one word is given as a weight value. .
  • a weighted word group that is stored in association with each sentence unit by the processing of “3-3.
  • the sentence-by-sentence search apparatus 1 creates a weighted related word group for each word by using the combination of the word and the reference probability of the word or the manifestation rule.
  • the sentence unit search device 1 creates and stores a weighted related word group for each word extracted from the entire document set.
  • the sentence unit search device 1 stores the weighted word group that is stored in association with each sentence unit, that is, the combination of the word and the word reference probability or each word of the manifestation vector.
  • the influence of the reference probability of words that are closely related to each word is reflected in the reference probability using the degree of association, and the weight value of each word is recalculated and stored.
  • the sentence unit search apparatus 1 similarly uses the degree of relevance for the weighted word group associated with each word, that is, the combination of the word and the word reference probability or the manifestation vector in the search process. Then recalculate the weight value of each word.
  • the sentence unit search device 1 performs a search process based on the word corresponding to the accepted word and the weight value recalculated for each word.
  • the related word group is created by performing the following processing by the sentence unit search device 1 for every word extracted in the explanatory diagram shown in FIG.
  • the sentence unit retrieval apparatus 1 uses a weighted word group stored in association with every sentence unit in "3-3. Quantification of manifestation per sentence unit".
  • a word group with a weight that has a reference probability of the word or more is extracted. This is because, as described above, the related word is a word that is likely to be noticed at the same time when one single word is noticed, so that the sentence unit is removed when one word is noticed. It is to do.
  • the sentence unit search device 1 integrates the weighted word groups having the reference probability of one word that is extracted by the above-described processing and having a predetermined value or more. Specifically, the reference probability of each word in each weighted word group is weighted by the reference probability of one word included in the weighted word group, and the reference probability of each word is averaged. The reason why the reference probability of one word is weighted is that the reference probability for each word in the weighted word group having a higher reference probability of one word is used.
  • FIG. 21 and FIG. 22 are flowcharts showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment creates a related word group.
  • the processing shown in the flowcharts of FIG. 21 and FIG. 22 includes processing for extracting a word group having a weight value equal to or greater than a predetermined value for one word, and integrating the weight value of each word of the extracted word group as a degree of association.
  • the process of creating a group of related words assigned to each word, the process of storing it in association with a single word, and for each word! Corresponds to the process that executes each process.
  • the CPU 11 of the sentence unit search device 1 selects one word from the list stored in the storage means 13 (step S601).
  • the CPU 11 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S602).
  • CPU11 identifies the tag “su>” added to the acquired document data by character string analysis and reads the sentence unit. Extrude (step S603).
  • the CPU 11 reads out the salience attribute stored in su> (step S604), and in step S601 of the set of words and word reference probabilities (weighted word group) stored in the salience attribute. It is determined whether or not the reference probability of the selected one word is greater than or equal to a predetermined value (step S605).
  • the CPU 11 determines that the reference probability is equal to or higher than the predetermined value (S605: YES)
  • the CPU 11 stores the weighted word group read out with the salience attribute in step S604 in the temporary storage area (step S606).
  • CPU 11 determines whether or not the processing up to step S606 has also been executed for step S604 for the entire text unit of the document data acquired at step S602 (step S607). If it is determined that CP Ul 1 is in the whole text unit and the process is executed (S607: NO), CCU11 returns the process to step S603 and reads the subsequent sentence unit (S603). The processes from step S604 to step S606 are executed.
  • the CPU 11 determines the weighted word group in which the reference probability of the selected one word is a predetermined value or more for all the document data. It is determined whether or not the force is extracted (step S608). If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is not less than a predetermined value (S608: NO), the CPU 11 returns the process to step S602 and continues to the next step.
  • the document data is acquired (S602), and the processing from step S603 to step S607 is executed.
  • step S608 If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is greater than or equal to a predetermined value (S608: YES), the CPU 11 is extracted by the process of step S606. Then, a set of weighted word groups stored in the temporary storage area 14 is created by calculating the sum of weight values weighted by the reference probability of one word for each word (step S609). ).
  • the CPU 11 determines that the reference probability of one word created in step S609 is a predetermined value or more.
  • the sum of the weighted word groups, that is, the weight value of each word of the summed weighted word group is normalized (step S610).
  • CPU 11 selects, in step S601, a weighted word group in which the reference probability of one word normalized in step S610 is greater than or equal to a predetermined value as a related word group having each weight value as a degree of relevance. Is stored in the storage means 13 or in the document storage means 2 via the document set connection means 16 (step S611).
  • step S612 the CPU 11 of the sentence unit search device 1 determines whether or not it has created and stored related word groups for all the words in the list stored in the storage means 13 (step S612). If CPU 11 creates and stores related words for all words and determines that they are not (S612: NO), CPU 11 returns the process to step S601 and selects the next word ( S601), the processing from step S602 to step S611 is executed for the selected word.
  • step S605 the CPU 11 of the sentence unit search device 1 simply performs the following normal process, rather than simply determining whether the reference probability is greater than or equal to the predetermined value.
  • the force may be compared with a predetermined value.
  • the CPU 11 of the sentence unit search device 1 uses the square root of the sum of squares of all reference probabilities so that the sum of the squares of the reference probabilities of each word associated with the sentence unit is “1”. Normalize by dividing
  • the CPU 11 of the sentence unit search device 1 performs normality by dividing each weight value by the square root of the sum of squares of all weight values.
  • the CPU 11 of the sentence unit search apparatus 1 specifies the related word group created when the processing shown in the flowcharts of Figs. 21 and 22 is performed for one word. An example is shown.
  • FIG. 23 is an explanatory diagram showing an example of a weighted word group in the course of each process when a related word group is created by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment.
  • the CPU 11 of the sentence unit search device 1 uses the word “ This is an example in which a weighted word group with a reference probability of “America Village” with a predetermined value (0.2) or more is extracted.
  • FIG. 23 (a) shows the weighted word groups GW, GW, GW extracted by the processing of the CPU 11 in step S605 shown in the flowcharts of FIGS. 21 and 22 and stored in the temporary storage area 14. ing.
  • Figure 23 (b) shows the same for step S607.
  • a weighted word group GW ′ ′ weighted and summed up by the processing of U11 is shown.
  • weighted word groups GW, GW, GW having a weight value (reference probability) of one word “America Village” with a predetermined value of 0.2 or more are extracted.
  • the value is multiplied by the weight value (reference probability) of one word “America Village” in each weighted word group.
  • the weight value reference probability
  • the weight value of each word of the generated word group GW ′, GW ′, GW is as follows.
  • the weight value (reference probability) of the word “America Village” is multiplied.
  • the weight value of each word in the weighted word group G W has an American character because the weight value (reference probability) of America Village is 0.6.
  • Word group GW (Autumn: 0 (0.6X0), American Village: 0.36 (0.6X0.6), ..., Okumaza: 0
  • the weight value of each word in the weighted word group GW ,! is one word “American Village” as shown in FIG. 23 (b).
  • the weight values weighted by the weight values (reference probabilities) are summed for each word.
  • the weight value of each word in the word group GW ′ shown in FIG. 23 (c) is summed as follows: the word group GW ′, GW ′, GW shown in FIG. 23 (b).
  • the CPU 11 of the sentence unit search device 1 squares the weight value of each word, calculates the square root of the sum of the squared values, Divide by the weight value of each word and make sure that the weight value of each word in the weighted word group GW '' is normalized.
  • the weighted word group GW '' integrated by weighting and summing is a multidimensional vector with each word as one dimension and the weight value of each word as an element in each dimension.
  • the multidimensional vector may be normalized by dividing each weight value (element) by the norm of the multidimensional vector. At this time, the norm is not necessarily the Euclidian norm.
  • the weighted word group power as a result of summing and normalizing in this way is created as a related word group of "America Village" by the CP U11 of the sentence unit search device 1.
  • the example shown below is an example of a related word group of the word “Ame Rikimura”. Each word is listed in descending order of weight value.
  • each weight value of the related word group created for the word w that is, the word w to the word w
  • bw (w: b, w: b, "', w:
  • the CPU 11 of the sentence unit search device 1 repeats the above-described process for all the words shown in the explanatory diagram of FIG. 6 to create a related word group for each word, and creates the document storage means 2 or the sentence. It is stored in the storage means 13 of the unit search device 1. In this way, weights that represent a group of meanings for each sentence unit are created and stored by creating a related word group in which the degree of association is quantitatively calculated for each word that appears in the document set. It is possible to reflect the influence of related words on the attached word group.
  • the degree of relevance of each word of the created related word group is reflected in the weighted word group stored for each sentence unit, that is, the set of the word and the reference probability of each word or the manifestation vector.
  • the sentence unit search device 1 reads the reference probability of each word that has already been calculated and stored, and uses each word's reference probability as a single word weight value as a single word weight value. A value obtained by multiplying the degree of relevance is recalculated and stored.
  • FIG. 24 shows a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows. The process shown in the flowchart of FIG. 24 corresponds to the process of reassigning the weight value of each word of the weighted word group associated with each sentence unit using the degree of association.
  • the CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage unit 2 via the document set connection unit 16 (step S71).
  • the CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis, and reads out the sentence unit (step S72).
  • the CPU 11 reads the salience attribute stored in ⁇ su> (step S73), and stores the word and word reference probability pair (weighted word group) stored in association with the salience attribute.
  • Each of the reference probabilities is recalculated to a weight value that takes the association into account using the related word group (step S74).
  • the CPU 11 re-stores each word and a weighted word group (a manifestation vector), which is a set of weight values recalculated in step S74, with the salience attribute added (step S75).
  • step S76 determines whether or not the sentence unit read in step S72 is the end of the document data. Whether the current sentence is the end of the acquired document data is determined by whether or not the su> tag follows the su> ⁇ / su> that sandwiches the current sentence. If so, it can be determined that it is the end. If the CPU 11 determines that it is not the end of the document data (S76: NO), the CPU 11 returns the processing to step S72 and continues the processing for the next sentence unit. On the other hand, if the CPU 11 determines that the end of the document data (S76: YES), the CPU 11 recalculates the weight value of each word in the weighted word group and associates it with the salience attribute for all document data. Judgment is made as to whether or not the processing to be memorized is completed (step S77).
  • the CPU 11 of the sentence unit retrieval apparatus 1 realizes the recalculation of the weight value of each word in step S74 by performing the following processing.
  • FIG. 25 is a processing procedure in which the CPU 11 of the sentence unit search device 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows the detail of. The process shown in the flowchart of FIG. 25 corresponds to a process of multiplying the relevance level of each word by the weight value of the weighted word group, and a process of reassigning the weight value of each word based on the multiplied weight value. [0292]
  • the CPU 11 of the sentence unit search device 1 reads each word of the weighted word group stored in association with the salience attribute read in step S74 of the flowchart of Fig. 24 and the reference probability of each word, and temporarily stores them. Stored in area 14 (step S81). The CPU 11 selects one of the words (step S82), and performs the following processing for the weight value of the selected word.
  • the CPU 11 reads the related word group to which the relevance level of each word stored in the storage means 13 or the document storage means 2 is given (step S83).
  • the CPU 11 acquires the degree of relevance from each word to one word from the related word group of each read word (step S84).
  • the CPU 11 multiplies the obtained degree of association from each word to one word by the reference probability of each word stored in the temporary storage area 14, and calculates the sum (step S85).
  • the CPU 11 determines whether or not the weight value has been recalculated for all the words stored in the temporary storage area 14 in step S81 (step S86). If CPU 11 determines that the weight value has not been recalculated for all the words (S86: NO), CPU 11 returns the process to step S82 and moves to the next word !, from step S82 to step S85. The process of recalculating the weight value is executed. If the CPU 11 determines that the weight value has been recalculated for each word (S86: YES), the CPU 11 returns the process to step S75 of the flowchart of FIG.
  • the processing for recalculating the weight value by the CPU 11 of the sentence unit search device 1 shown in the flowchart of FIG. 24 and step S74 in the flowchart of FIG. 24 is to calculate the reference probability in the first embodiment. Then, it may be executed in the process of storing it as a weight value representing the manifestation of each sentence unit.
  • the configuration may be such that the processing shown in the flowchart of FIG. 25 and step S74 is executed between the processing of step S306 and step S307 in the processing procedure shown in the flowchart of FIG.
  • the CPU 11 of the sentence unit search device 1 recalculates the reference probability calculated for each word to a weight value that reflects the association.
  • a weight value that reflects the association.
  • the sentence unit search device 1 calculates the weight value representing the manifestation of "Osaka” in a sentence unit as follows. cure. It is assumed that the relevance group created for “America Village” is “0.3” for “Osaka”. Even if a word stored in association with a sentence unit contains "America Village”, the reference probability of "America Village” is 0.4, and "Osaka" is not included.
  • the CPU 11 of the sentence unit search device 1 multiplies the reference probability 0.4 of “America Village” by the relevance level 0.3 from “America Village” to “Osaka” to obtain “Osaka” in the sentence unit.
  • the weight value of is recalculated to “0.12” instead of “0”.
  • s is the weight value representing the manifestation in each sentence s of the word w with contextual association.
  • the sentence unit search apparatus 1 recalculates the weight value of each word as shown in the following formula (6).
  • V (Sj) alience ⁇ w ⁇
  • the expression in the last line of the expression (7) is a weighted word group, that is, a pair of a word and a word reference probability as a manifestation vector v (s).
  • the manifestation vector V (k after resolving the association with salienc e (w I pre (s)) as the k-th element
  • s represents the principle of calculating the weight value of each word.
  • each bw, ..., bw is a vector of related words for all words w, ..., w
  • Toru V (s) is the manifestation i 1 N in the oblique coordinate system based on the relevance vector bw, ..., bw
  • the manifestation vector V (s) taking into account the association can be interpreted as the manifestation vector v (s) with the reference probability as an element as it is rotated in the direction of the related word axis.
  • the oblique coordinate system based on the relevance vectors bw,..., Bw is each unit that reflects the association.
  • each base vector (vector of size 1 in the direction of each word dimension) is a coordinate system in which the angle between the base vectors of words that are not related to each other and are highly related is small. It is. [0308] When the transformation matrix with each element of b is multiplied by the manifestation vector with the reference probability as the element, j, k
  • FIG. 26 is an explanatory diagram showing an example of the contents of weight values representing the manifestation of each word calculated by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment.
  • the weight value of each word for each sentence s, s shown in Fig. 26 (a) is the related word.
  • the weight value of each word for each sentence s, s shown in Fig. 26 (b) is the value after association is considered using the related word group.
  • Fig. 26 The specific example shown in Fig. 26 is an example of sentence units extracted from a Japanese spoken language corpus (http: ⁇ www. Kokken.go.jp/katsudo/kenkyujyo/corpus 8 CSjZvoll7ZD03F0040).
  • the CPU 11 of the sentence unit search apparatus 1 sets a combination of a word and a word reference probability or a manifestation vector, that is, a weighted word, that quantitatively represents the meaning of the accepted word. Add associations with related words to the group. Below, we recalculate the weight value of each word in the weighted word group associated with the word received by the CPU 11 of the sentence unit search device 1, taking the association into account, and perform a search based on the recalculated weight value. The process to be executed will be described.
  • FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the third embodiment.
  • the same reference numerals are used for the same steps as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16 and 17 in the first embodiment. Detailed description is omitted.
  • the processing in step S4001 surrounded by the two-dot chain line is different from the processing procedures shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. . That is, the difference is that step S4001 described below is added between step S411 and step S412.
  • the CPU 11 calculates the reference probabilities in the temporary storage area 14 and narrows down all words stored with reference probabilities greater than or equal to a predetermined value (step S411), and calculates in step S408.
  • the calculated reference probability is recalculated to a weight value that reflects the association (step S4001).
  • step S4001 the CPU 11 recalculates the weight value reminiscent of the association, as in the process shown in the flowchart of FIG. 25, selects one word at a time, and selects each word for the selected word. Is calculated by multiplying the degree of relevance of each word by the reference probability of each word.
  • the CPU 11 reads the weighted word group to which the association is added, which is stored in association with each sentence with respect to the weighted word group to which the association obtained in step S4001 is added. And execute a process of extracting a similar sentence. Since the subsequent processing for the weighted word group to which the association is added is the same as that in the first embodiment, detailed description thereof is omitted.
  • the sentence unit search apparatus 1 is a group of meanings using associated words and taking into account associations between sentences separated from the document data stored in the document storage means 2 and received words. It is possible to directly output a sentence that is judged to be similar. Therefore, by executing the sentence unit search method of the present invention, it is possible to effectively extract sentence units having similar contextual meanings in association with associations and directly output them.
  • the CPU 11 of the sentence unit search device 1 associates a weighted word group with the received word, and determines whether the word is similar to the weighted word group stored in advance for each sentence. Judgment In this case, as in the processing procedure shown in the flowchart of FIG. 27, it is not always determined based on whether or not the weighted word group includes the same word. Furthermore, the difference between the weight values assigned to the same word is calculated, and the smaller the calculated difference, the more similar it is not necessarily determined.
  • the CPU 11 of the sentence unit search apparatus 1 extracts a sentence unit whose meaning is similar to the accepted word, and expresses the meaning as a manifestation vector and a relevance vector. The case where this is realized by calculating the distance between them will be described below.
  • FIG. 28 is a flowchart showing the search processing procedure of the sentence unit search device 1 and the reception device 4 when the vector representation in the third embodiment is used. Note that the processing procedure shown in the flowchart of FIG. 28 is the same as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment and the flowchart of FIG. 19 in the second embodiment. The same reference numerals are used for the respective steps, and detailed description thereof is omitted.
  • each step S50 surrounded by the alternate long and short dash line is also processed up to step S506.
  • the processing in step S 5001 surrounded by a two-dot chain line is different from the processing procedure shown in the flowchart of FIG. 19 in the second embodiment. That is, the difference is that step S5001 described below is added between step S501 and step S502.
  • the CPU 11 of the sentence unit search device 1 recalculates the manifestation vector calculated in step S501 into an manifestation vector reflecting the association of related words (step S5001).
  • the CPU 11 reads the weighted word group obtained in step S5001 in consideration of the association and stores the manifestation vector in consideration of association, which is stored in association with each sentence. , A process for extracting similar sentences is executed. A manifestation vector with associations added The process of reading and extracting a similar sentence is the same as in the second embodiment, and a detailed description thereof is omitted.
  • step S5001 by CPU 11, the process of recalculating the manifestation vector into the manifestation vector taking into account the association with the related word is performed using the relevance vector group (matrix) of the manifestation vector calculated in step S501. Convert (rotate) and calculate as shown in equation (7). Specifically, the manifestation vector V () is calculated by adding the above association to the multidimensional vector v (s) whose element is only the reference probability.
  • step S503 for calculating the distance between the manifestation vector associated with the word accepted by CPU 11 and the read manifestation vector Specifically, in the third embodiment, the calculation is as follows.
  • the manifestation vector recalculated with the association of the received word U is represented as V (u), and the retrieved manifestation vector with the association added in advance is represented as V (s).
  • the CPU 11 calculates the cosine distance as shown in the following equation (8).
  • step S506 the CPU 11 gives similarities in descending order of the calculated cosine distance.
  • the manifestation vector associated with each sentence unit and the word is the dimension direction of a word having a high degree of relevance in which the dimensions corresponding to each word are not orthogonal. It is handled in an oblique coordinate system in which the angle between them becomes small. For this reason, when comparing the distances between vectors when determining whether or not they are similar, if there is an element in the dimension direction of a word that has a high degree of association, it is determined that the words are similar. Become so.
  • the sentence unit s is the accepted word when the manifestation of the Dutch word is high in the accepted word. Is not judged to be similar. However, the obviousness of “America Village” in the accepted word is high. When the accepted word is high, the manifestation of “Osaka” is excited and increased, so the sentence unit s is similar to this accepted word. This increases the possibility of being judged.
  • the text data received as the search result is displayed on the monitor of the display means 46 provided in the reception device 4, but the received text data is also voiced.
  • a configuration may be adopted in which the signal is converted and output via the speaker of the audio input / output means 47.
  • the user can obtain a sentence with similar context and meaning as a search result by using multiple words that he / she has input or by inputting a conversation with another user.
  • the received words also have spoken language skills
  • sentences that are omitted in utterances and that have similar word manifestations, including words represented by zero pronouns, can be obtained directly as search results.
  • the sentence unit search apparatus 1 specifies and stores information indicating the manifestation for each sentence, but the tag ⁇ A configuration may be adopted in which p> ⁇ Zp> is sandwiched, a feature pattern is specified for the paragraph, information indicating the manifestation is stored by the salience attribute, and the paragraph is output as a search result. It is not limited to a sentence or paragraph, but may be a phrase as long as it is a unit that represents a set of certain meanings. In the case of spoken language, the character string that can be identified as one sentence can be very long.
  • the document data composed of spoken language is stored in advance separately from the document data that also has writing ability.
  • a configuration may be adopted in which the document storage means 2 stores the probability every time a word feature pattern is specified and a reference probability is calculated.
  • the CPU 11 of the sentence unit search device 1 determines whether or not the consecutively received words are a series of words, information for identifying the accepting device 4 that is the transmission source of the words, and the accepting device. It is also possible to use information indicating that 4 has detected a user's search start 'completion operation.
  • words can be stored in the document storage unit 2 in units corresponding to pages of document data stored in the document storage unit 2 in advance.
  • the sentence unit search device 1 performs all of the processing for obtaining and tagging document data, regression analysis for obtaining the reference probability, and processing when a word is received.
  • it may be divided into a sentence unit search device and a document storage device.
  • the document data is acquired by performing Web crawling in the document storage device, and further, a tag is added to the text data by morphological analysis and syntactic analysis and stored.
  • an equation for calculating the reference probability is obtained by regression analysis based on the document data stored in the document storage device, and the sentence data is stored for each sentence using the obtained equation.
  • the process of storing the word and the reference probability of the word is performed in advance.
  • the sentence unit search device specifies a feature pattern when text data converted from words is received, acquires a regression formula for calculating a document storage device force reference probability, calculates a reference probability, and performs a search.
  • Embodiments 1 to 3 the input of words such as a character string input or a voice input from the user is converted into text data by the reception device 4, and transmitted to the sentence unit search device 1.
  • the configuration is as follows.
  • the sentence unit searching apparatus 1 may be configured to include an input / output unit that receives a user's character string input operation and a voice input unit that receives a user's voice input.
  • FIG. 29 is a block diagram showing a configuration in the case where the sentence unit retrieval method 1 of the present invention is implemented by the sentence unit retrieval apparatus 1.
  • the sentence unit search device 1 includes a CPU 11, an internal bus 12, a storage unit 13, a temporary storage area 14, a document set connection unit 16, and an auxiliary storage unit 17, as well as a mouse or a keyboard that accepts user operations. It further includes an operation means 145, a display means 146 such as a monitor, and a voice input / output means 147 such as a microphone and a speaker.
  • the CPU 11 of the sentence unit search device 1 detects the frequency or the conversation speed indicating the characteristics of the speech input from the speech input means, and utters it.
  • the feature pattern of each word can be specified.
  • the grammatical feature pattern of each word can be converted to text data by speech recognition and searched based on the text data.
  • the accepting devices 4, 4,... Only convert the received character string or speech word into a certain length, convert it into digital data, and transmit it. It was configured as a device.
  • the receiving device 4, 4,... Receives the program stored in the storage means 43 by the receiving device 4, 4,.
  • the attached words may be configured to perform natural language analysis such as morphological analysis and syntactic analysis, or phonemic analysis.
  • the CPU 41 of the accepting devices 4, 4,... Calculates a weight value that represents the manifestation of each word in the accepted words, and transmits the calculated weighted word group to the sentence unit retrieving device 1 as a search request. But you can.
  • the sentence unit search method according to the present invention is a combination capable of voice recognition of conversation between users.
  • the present invention can be applied to an application in which a computer apparatus participates in a conversation between users and realizes a conversation. It can also be applied to applications that provide a conversation-linked advertisement presentation service that switches according to the flow of conversation or chat context between users. It can also be applied to conference support services that present similar and related minutes from past minutes depending on the context flow during the meeting. Furthermore, it is also possible to apply it to a writing support service that accepts texts in writing as words and provides related information according to the context flow.

Abstract

A computer for executing a sentence search method sorts document data on a set of documents into sentences in advance. Information representing the cohesion of the meaning reflecting the flow of the context from a previous sentence to another sentence, namely, a weighted group of words in which a weight value is given to each word of one sentence is associated with each sentence, and the sentences and the associated weighted word groups are stored. When the computer receives a word, the computer acquires information representing the cohesion of the meaning in the flow of the uttered conversation, namely, a weighted word group in which a weight value is given to each word and associates them, extracts a sentence similar to the cohesion of the meaning according to the weighted word group associated with the word, and outputs the sentence as a search result. The weight value given to each word may be a value reflecting the influence according to the weight value of the related word in the sentence and the degree of relation between the related word and each word.

Description

明 細 書  Specification
文単位検索方法、文単位検索装置、コンピュータプログラム、記録媒体及 び文書記憶装置  Sentence search method, sentence search device, computer program, recording medium, and document storage device
技術分野  Technical field
[0001] 本発明は、検索のためにユーザ力 受け付けたテキスト、音声等の言葉に基づい て、多数の文書データ記憶されて 、る文書集合力 の検索を行う検索方法に関する 。特に、文脈の流れの中で意味が動的に変化する文書中の意味のまとまりの単位で ある文単位から、受け付けた言葉と意味合いが類似する文単位を直接的に検索する ことができる文単位検索方法、文単位検索装置、コンピュータを前記文単位検索装 置として機能させるコンピュータプログラム、該コンピュータプログラムを記録したコン ピュータ読み取り可能な記録媒体、及び文書記憶装置に関する。  The present invention relates to a search method for searching a large number of document data and searching for a document collective power based on words such as text and voice received by the user for searching. In particular, sentence units that can be directly searched for sentence units whose meanings are similar to accepted words from sentence units that are groups of meanings in a document whose meaning changes dynamically in the context flow The present invention relates to a retrieval method, a sentence unit retrieval apparatus, a computer program that causes a computer to function as the sentence unit retrieval apparatus, a computer-readable recording medium that records the computer program, and a document storage apparatus.
背景技術  Background art
[0002] インターネット上で提供される各種サービスには、ユーザによって入力されたキーヮ ード又は文に基づいて、インターネットで公開されている文書から関連する文書を検 索し、一覧にして出力する文書検索サービスがある。  [0002] For various services provided on the Internet, documents that are searched for related documents from documents published on the Internet and output as a list based on the key words or sentences input by the user There is a search service.
[0003] 従来の文書検索サービスには、以下のようなものがある。インターネットで公開され ている文書を自動的に集めて記憶し、夫々の文書毎に、文書中に出現する単語を文 書中での出現確率と共に記憶しておき、キーワード又は文等の言葉を受け付けた場 合に、記憶した文書集合力も受け付けたキーワード又は文に含まれる単語の出現確 率の高い順に優先順位を付与して文書を抽出し、抽出した文書から、当該単語が含 まれる文又は段落を出力する。  [0003] Conventional document search services include the following. Documents published on the Internet are automatically collected and stored, and for each document, words appearing in the document are stored together with the appearance probability in the document, and words such as keywords or sentences are accepted. In such a case, the document is extracted by assigning priorities in descending order of the probability of occurrence of words included in the keyword or sentence that has received the stored document collective power, and the sentence or sentence including the word is extracted from the extracted document. Output paragraph.
[0004] 文書検索サービスを利用するユーザは、知りたい情報を検索するために関連する キーワードを自分で考える必要がある。最近の文書検索サービスでは、自然文を入 力文として受け付け、入力文を形態素解析し、入力文のキーワードを識別して検索 要求を自動的に作成することができる場合もある。  [0004] A user who uses a document search service needs to think about keywords related to searching information he or she wants to know. In recent document retrieval services, it is sometimes possible to accept a natural sentence as an input sentence, morphologically analyze the input sentence, identify a keyword of the input sentence, and automatically create a search request.
[0005] また、文書検索サービスでは通常、自然文の入力を受け付ける場合でも、入力文 に含まれる単語を抽出し、抽出した単語が含まれて ヽる文書を検索結果として出力 する。したがって、ユーザは、目的の検索結果を得るために入力するキーワードに関 連するキーワード又は入力するキーワードの意味付けが変化する単語を更に入力し て絞込みをさせる必要があった。例えば、単に「大統領」では、どの国の大統領なの かは不明であるため、「大統領、アメリカ」とキーワードを付加する必要がある。更にァ メリ力の大統領の何を調べたいかによつて、「大統領、アメリカ、出身」、「大統領、ァメ リカ、政策」等、検索結果を得やすくするための情報を考える必要がある。 [0005] Also, in the document search service, even when natural text input is accepted, words included in the input sentence are extracted, and documents containing the extracted words are output as search results. To do. Therefore, the user needs to further narrow down the keyword by inputting further keywords related to the keyword to be entered or a word whose meaning of the inputted keyword is changed in order to obtain a target search result. For example, simply “president” needs to add the keyword “president, USA” because it is unknown which country the president is. Furthermore, depending on what the American president wants to examine, it is necessary to consider information that makes it easy to obtain search results, such as “President, United States, native”, “President, US, policy”.
[0006] したがって、ユーザが得たいと考える検索結果を実際に得るためには、ユーザはキ 一ワードの組み合わせを考え、何回力試行することが必要になる。例えば、ユーザが 「アメリカの大統領は、他の国との間で経済面の問題が発生した場合どのような対策 をとるの力」という情報を知りたい場合であっても、「アメリカ、大統領、経済」では検索 結果が大量に出力され、大量に出力された検索結果力 ユーザは文書を選択しなけ ればならない。そこで例えば、「政策」というキーワードを付加して絞込み、「アメリカ、 大統領、経済、政策」というキーワードを入力する。この場合、「政策」という言葉が意 味の広い上位概念であっても、「政策」というキーワード自体で絞込みをすることにな るため、内容としては経済政策についての論述が記載された文書も、「政策」という言 葉の出現頻度が低い文書は漏れてしまうことがある。このように、ユーザが検索の目 的を達するためのキーワードを考えて試行することで検索結果を得るのは難し 、。付 加的な情報を入力する度に、本来の検索の目的から、検索結果の内容が離れていく 場合もある。  [0006] Therefore, in order to actually obtain a search result that the user wants to obtain, the user needs to think about a combination of keywords and try many times. For example, if a user wants to know the information that “the US president is the power of what to do if there is an economic problem with another country,” “US, president, In “Economy”, a large amount of search results are output, and a large amount of search results are output. The user must select a document. So, for example, add the keyword “policy” to narrow down and enter the keyword “US, president, economy, policy”. In this case, even if the word “policy” is a broad concept, it is narrowed down by the keyword “policy” itself. Therefore, there are documents that contain statements about economic policies. Documents with a low frequency of “policy” may be leaked. In this way, it is difficult for users to obtain search results by trying keywords that will help them reach their search goals. Each time you enter additional information, the content of the search results may deviate from the purpose of the original search.
[0007] また、上述の例でユーザが知りたいのは、経済面での政策であって、しかも国際的 な政策についてである。ユーザの入力が自然文によるものであっても、「アメリカ、大 統領、他の国、経済、問題、発生、場合、対策」の単語の何れの単語が一番重要で あるのかは、人間が読む場合は把握できるが、装置又はコンピュータが扱う情報量と して定量的に表現することは難しい。したがって、キーワードは全て含んでいるものの 、「アメリカの経済の問題と他国の大統領の対策」とについて論述された文書が出力 されることち想定でさる。  [0007] In the above example, the user wants to know about economic policies and international policies. Even if the user's input is in natural language, human beings decide which of the words “America, President, other countries, economy, problems, outbreaks, countermeasures” is most important. It can be grasped when reading, but it is difficult to express quantitatively as the amount of information handled by the device or computer. Therefore, although all the keywords are included, it is assumed that a document describing “American economic problems and countermeasures of presidents of other countries” will be output.
[0008] さらに、検索対象である文書が非常に長い場合は、その文書の中で文脈が動的に 変化しているにも拘わらず、その文書を一単位として出現する単語に基づいた検索 力 Sされる。したがって、アメリカの大統領の歴史と、他の国の大統領の歴史と、各国の 経済のしくみと、各国での失業対策についての内容とが章に分けられて記載されて V、る文書が存在する場合、検索のキーワードをほとんど含むために検索結果として 出力される。実際にはそれらの章が文脈的に繋がっていない場合でも、キーワードを 含む文又は段落を夫々部分的に抽出した結果が出力されてしまう。そのため、その 抽出された部分に至るまでの先行文脈の影響を含む意味と、ユーザの意識の上での 検索意図とが、意味的にマッチする力否かは量り得ない。 [0008] Furthermore, when a document to be searched is very long, the search is based on a word that appears as a unit even though the context dynamically changes in the document. Force S. Therefore, there is a document that describes the history of the president of the United States, the history of the presidents of other countries, the economic structure of each country, and the contents of measures against unemployment in each country, divided into chapters. In this case, it is output as a search result because it contains most of the search keywords. In fact, even if those chapters are not contextually connected, the result of partial extraction of sentences or paragraphs containing keywords will be output. Therefore, it cannot be measured whether the meaning including the influence of the preceding context up to the extracted part and the search intention based on the user's consciousness match semantically.
[0009] 一方、検索対象である文書に、検索のために入力したキーワードは頻繁に出現し てはいないにも拘わらず、入力したキーワードが文脈上重要な意味を持って含まれ ている場合がある。例えば、主題となる単語ほど指示代名詞又はゼロ代名詞で表現 される。したがって、知りたい情報を検索するユーザは、検索のために入力したキー ワードが指示代名詞又はゼロ代名詞で表現されている文又は段落こそ、検索結果と して得たい情報である場合が考えられる。し力しながら、実際の出現頻度で検索結果 に優先順位を付与する場合、ユーザが入力したキーワードの出現頻度が低 、ために 絞込みによって候補から除かれ、検索結果として出力されない。  [0009] On the other hand, there are cases in which the keyword entered for the search is included in the document to be searched, although the keyword entered for the search does not appear frequently but has an important meaning in context. is there. For example, the subject word is expressed with a pronoun or zero pronoun. Therefore, the user who searches for the information he / she wants to know may be the information he / she wants to obtain as a search result, in which the sentence or paragraph in which the keyword input for the search is expressed in the demonstrative pronoun or zero pronoun. However, if priority is given to the search results with the actual appearance frequency, the appearance frequency of the keyword input by the user is low, so it is excluded from the candidates by narrowing down and is not output as the search results.
[0010] そこで、文書中の単語を抽出し、当該単語の品詞情報、単語間の係り受け情報、 更に指示代名詞又はゼロ代名詞と照応関係にある単語を明示した情報を、文書を形 態素解析等により解析した結果に付加して記憶させておき、記憶させた情報に基づ いて装置又はコンピュータによる文書の検索、質問応答、機械翻訳を実現する技術 が提案されて 、る (非特許文献 1)。  [0010] Therefore, the word in the document is extracted, and the document is subjected to morphological analysis using the part-of-speech information of the word, the dependency information between the words, and the information specifying the anaphoric relationship with the demonstrative pronoun or zero pronoun. In addition, a technique has been proposed in which a document is retrieved by a device or a computer, a question is answered, and machine translation is performed based on the stored information. ).
[0011] 単語間の係り受け又は照応等の関係は、自然文であるがために文節の順序が複 雑であり、人間が読む場合は意味を判別できても機械的に認識することが難しい。そ こで、非特許文献 1に記載されている技術では、単語間の係り受け又は照応等の関 係をタグによって文又は句毎の情報として文書データに付加して記憶しておく。また 、 日本語の場合は特に、主語が省略されている文が多いので、機械的に翻訳する際 に主語の補完が必要である。そこで非特許文献 1に記載されている技術では、文毎 に主語又はゼロ代名詞等の補完情報を付加する。これ〖こより、当該情報が付加され た文書を利用することによって正確に機械翻訳することが可能となる。文中で省略さ れた単語、又は指示代名詞若しくはゼロ代名詞で表されている単語も、例えば文書 を検索する場合の出現頻度の算出等の応用技術に利用することができる。 [0011] Relationships between words, such as dependency or anaphora, are natural sentences, so the order of phrases is complex, and it is difficult for humans to recognize them mechanically even if the meaning can be determined. . Therefore, in the technique described in Non-Patent Document 1, a relationship such as dependency between words or anaphora is added to document data as information for each sentence or phrase by a tag and stored. Also, especially in the case of Japanese, there are many sentences in which the subject is omitted, so it is necessary to complement the subject when translating mechanically. Therefore, in the technique described in Non-Patent Document 1, supplementary information such as the subject or zero pronoun is added for each sentence. This makes it possible to perform accurate machine translation by using a document with the information added. Omitted in sentence The words represented by the pronoun or zero pronoun can also be used for application techniques such as the calculation of the appearance frequency when searching documents.
非特許文献 1:橋田浩ー「大域文書修飾」人工知能学会全国大会 (第 11回)論文集 p p. 62-63 (1997)  Non-Patent Document 1: Hiroshi Hashida “Global Document Modification” The Japanese Society for Artificial Intelligence (11th) Proceedings p p. 62-63 (1997)
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0012] 文章を書く時、又は発話する時の、その各文又は各発話夫々におけるユーザの注 目対象 (重点対象)は、会話や文章の文脈の流れに従って動的に変化する。つまり、 会話や文章における単語への注目度合いを表す重みは、動的に変化する。よって、 会話や文章に関連する情報を検索するサービスを実現するためには、文脈に応じた 単語の重みの動的変化を追跡する必要がある。  [0012] When writing a sentence or when speaking, the user's attention object (priority object) in each sentence or each utterance dynamically changes according to the context or the context flow of the sentence. In other words, the weight representing the degree of attention to words in conversations and sentences dynamically changes. Therefore, in order to realize a service that retrieves information related to conversations and sentences, it is necessary to track dynamic changes in word weight according to the context.
[0013] し力しながら、従来の文書検索サービスでは、検索のために入力された単語の出現 頻度の高い文書を抽出し、抽出した文書から、当該単語を含む文又は段落を抽出し て出力するため、当該単語のその文又は段落の文脈で動的に変わる重みについて は考慮されずに検索される。したがって、出現頻度に基づく検索では、確かに検索の ために入力された単語を含んではいるものの、文脈上当該単語がユーザが考えるよ うに使用されていない場合があり、ユーザの検索目的を達成することができるとは限 らない。各単語の文脈上の意味における各文での重み、即ち文脈上注目されている か否かについては特定できない。したがって、入力したキーワードをユーザの考える 意味合 、通りに使用した文又は段落を出力することはできな 、。  However, in the conventional document search service, a document with a high frequency of appearance of words input for search is extracted, and a sentence or paragraph including the word is extracted and output from the extracted document. Therefore, the weights that change dynamically in the context of the sentence or paragraph of the word are searched without being considered. Therefore, in the search based on the appearance frequency, although the word input for the search is surely included, the word may not be used as the user thinks in context, thereby achieving the user's search purpose. It is not always possible. It is not possible to specify the weight of each sentence in the contextual meaning of each word, that is, whether or not it is noticed in context. Therefore, it is not possible to output the sentence or paragraph used according to the meaning of the keyword entered by the user.
[0014] また、非特許文献 1の技術では、品詞情報等の文法に照らして識別が可能な情報 を自動的に解析し、指示代名詞又はゼロ代名詞等の補完、照応又は係り受けにつ いての情報を文書に付加することができる。当該情報の付カ卩により、参照されている 名詞を出現頻度として利用することができるので、文又は段落等での単語間の関係 は付加された情報により解析が可能である。しかしながら、各単語の文又は段落での 注目されている度合い、即ち顕現性は、定量的に測ることはできない。  [0014] In addition, the technology of Non-Patent Document 1 automatically analyzes information that can be identified in the context of grammar, such as part-of-speech information, and supplements, correlates, or depends on demonstrative pronouns or zero pronouns. Information can be added to the document. By adding the information, the noun being referred to can be used as the frequency of appearance, so the relationship between words in sentences or paragraphs can be analyzed with the added information. However, the degree of attention in each sentence or paragraph, ie the manifestation, cannot be measured quantitatively.
[0015] 非特許文献 1の技術は、自然文による質問に対して当該質問文で省略されている 単語等を考慮してコンピュータに応答させる質問応答の実現へ応用が可能である。 しかし、複数のユーザによる対話の文脈上の意味を定量的な値として算出し、第三 者の発話としてユーザの対話の文脈に沿った発話を生成し、提示することを可能に するのは容易でない。 [0015] The technology of Non-Patent Document 1 can be applied to the realization of a question response in which a computer responds to a question in a natural sentence in consideration of a word or the like omitted in the question sentence. However, it is easy to calculate the contextual meaning of conversations by multiple users as a quantitative value, and to generate and present utterances according to the user's conversation context as third party utterances. Not.
[0016] また、従来の文書検索サービスでは、文書中に出現する頻度が少ない場合でも文 脈上深く関連する背景知識を表わすような単語を考慮して検索することはできなかつ た。したがって、検索するユーザが意識しているが検索のために入力された単語とし ては現れていない単語を、同様に連想させる文又は段落を直接的に出力することは できなかった。  [0016] Further, in the conventional document search service, even when the frequency of appearance in a document is low, it is not possible to search in consideration of words that represent background knowledge deeply related to the context. Therefore, it is impossible to directly output a sentence or a paragraph that is similarly associated with a word that the user who is searching for is aware of but does not appear as a word input for the search.
[0017] 本発明は斯力る事情に鑑みてなされたものであり、一又は複数の文力 なる文単位 毎に、その文単位での単語の顕現性を表わす重み値が夫々付与された重み付き単 語群を対応付けて記憶しておき、検索のために受け付けた言葉についても、その言 葉での重み値が付与された重み付き単語群を対応付け、重み付き単語群が類似す る文単位を抽出して出力する構成とする。受け付けた言葉から、ユーザの意識にある 先の言葉力 の文脈が反映された意味を表わす情報を自動的に生成し、文脈の流 れの中で意味が動的に変化する文書中の文単位の内から、受け付けた言葉力 生 成された情報が表わす文脈上の意味のまとまりが類似する文単位を直接的に検索 することができる文単位検索方法、文単位検索装置、コンピュータを前記文単位検 索装置として機能させるコンピュータプログラム、及び該コンピュータプログラムを記 録したコンピュータ読み取り可能な記録媒体を提供することを目的とする。  [0017] The present invention has been made in view of such circumstances, and for each sentence unit of one or a plurality of sentence powers, a weight value indicating a word manifestation in the sentence unit is assigned. Word words are associated with each other and stored, and words accepted for search are also associated with weighted word groups assigned weight values in the words, and the weighted word groups are similar. The sentence unit is extracted and output. Sentence units in a document whose meaning changes dynamically in the context flow, automatically generating information that reflects the context of the previous word power in the user's consciousness from the received words The sentence unit search method, sentence unit search apparatus, and computer that can directly search sentence units with similar contextual meanings represented by the information generated from the received word power It is an object of the present invention to provide a computer program that functions as a search device, and a computer-readable recording medium that records the computer program.
[0018] 本発明の目的は、文単位又は受け付ける言葉に対応付けられる重み付き単語群 中の各単語の顕現性を表わす重み値を、後続の文単位又は言葉で出現する確率又 は参照される確率として算出することにより、文脈の流れの中にある文単位又は言葉 夫々で時系列に変化する単語の顕現性を定量的に表わして用いることができる文単 位検索方法及び文書記憶装置を提供することにある。  [0018] An object of the present invention is to refer to the probability or occurrence of a weight value indicating the manifestation of each word in a weighted word group associated with a sentence unit or a received word in the subsequent sentence unit or word. Providing a sentence unit search method and document storage device that can quantitatively represent the manifestation of words that change in time series in sentence units or words in the context flow by calculating as probabilities There is to do.
[0019] また、本発明の目的は、関連する単語への関連度を定量的に算出し、各文単位又 は言葉における各単語の顕現性に関連度を反映させることにより、ユーザ力 発せさ れる言葉又は筆記された文章には出現していない場合でも、ユーザが言葉を発して V、るとき又は筆記して 、るときに意識して 、る単語を連想させる文単位をも効果的に 検索することができる文単位検索方法及び文書記憶装置を提供することにある。 課題を解決するための手段 [0019] Further, an object of the present invention is to generate user power by quantitatively calculating the degree of association with related words and reflecting the degree of association in the manifestation of each word in each sentence unit or word. Even if it does not appear in a written word or written sentence, it is effective to use a sentence unit that reminds the user when he / she utters a word, or when writing or writing. An object of the present invention is to provide a sentence unit retrieval method and a document storage device which can be retrieved. Means for solving the problem
[0020] 第 1発明に係る文単位検索方法は、自然言語からなる複数の文書データが記憶さ れている文書集合を用い、該文書集合から取得した文書データを一又は複数の文 力もなる文単位に分別しておく一方、言葉を受け付け、受け付けた言葉に基づいて 前記文書集合から分別してある文単位を検索する文単位検索方法にお!、て、文書 データ中に連なる文単位夫々に、各文単位での重み値が付与された複数の単語か らなる重み付き単語群を対応付けて予め記憶しておくステップと、言葉を受け付けた 場合、該言葉に、該言葉での重み値が付与された複数の単語からなる重み付き単語 群を対応付けるステップと、受け付けた言葉に対応付けた重み付き単語群と類似す る重み付き単語群が対応付けて記録されて ヽる文単位を、前記文書集合から抽出 する類似文単位抽出ステップと、抽出した文単位を出力するステップとを含むことを 特徴とする。  [0020] The sentence unit retrieval method according to the first invention uses a document set in which a plurality of document data composed of natural languages is stored, and the document data obtained from the document set is a sentence that also has one or more sentence strengths. In the sentence unit search method that accepts words and retrieves sentence units that are separated from the document set based on the accepted words while separating them into units! A step of pre-storing a weighted word group consisting of a plurality of words to which a weight value is assigned in units of sentences, and storing a weight value when the word is accepted. A step of associating a weighted word group consisting of a plurality of words and a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with each other. Extract from set It includes a step of extracting similar sentence units to be output and a step of outputting the extracted sentence units.
[0021] 第 2発明に係る文単位検索方法は、前記類似文単位抽出ステップは、受け付けた 言葉に対応付けた重み付き単語群の内の複数の単語の重み値の分布と、予め分別 された文単位に対応付けられている重み付き単語群の内の複数の単語の重み値の 分布とが、所定の条件を満たすカゝ否かを判断するステップと、所定の条件を満たすと 判断された重み付き単語群が対応付けられている文単位を抽出するステップとを含 むことを特徴とする。  [0021] In the sentence unit search method according to the second invention, the similar sentence unit extraction step is preliminarily classified from the distribution of weight values of a plurality of words in the weighted word group associated with the received word. A step of determining whether or not a distribution of weight values of a plurality of words in a weighted word group associated with a sentence unit satisfies a predetermined condition and a predetermined condition is determined A step of extracting a sentence unit associated with the weighted word group.
[0022] 第 3発明に係る文単位検索方法は、前記類似文単位抽出ステップは、予め分別さ れた文単位から、受け付けた言葉に対応付けた重み付き単語群と同一の単語を含 む単語群が対応付けられた文単位を抽出するステップと、受け付けた言葉と抽出し た文単位とで、対応付けられた単語群の内の同一の単語毎に重み値の差分を算出 するステップと、抽出した文単位に、算出した差分が小さい順に優先順位を付与する ステップとを含み、抽出した文単位を、優先順位に基づいて出力することを特徴とす る。  [0022] In the sentence unit search method according to the third invention, the similar sentence unit extraction step includes a word including the same word as the weighted word group associated with the received word from the sentence units sorted in advance. A step of extracting a sentence unit associated with the group; a step of calculating a difference in weight value for each identical word in the word group associated with the received word and the extracted sentence unit; A step of assigning priorities to the extracted sentence units in ascending order of the calculated difference, and the extracted sentence units are output based on the priorities.
[0023] 第 4発明に係る文単位検索方法は、前記重み付き単語群を、各単語を 1次元とし、 単語毎に付与される重み値の大きさを各単語に対応する次元方向の要素として持つ 多次元ベクトルとして算出するステップを含み、前記類似文単位抽出ステップは、分 別した文単位毎に記憶してある前記多次元ベクトルと、受け付けた言葉に対応付け た前記多次元べ外ルとの距離を算出するステップと、文単位に、算出した距離が短 V、順に優先順位を付与するステップとを含み、付与された優先順位に従って出力す ることを特徴とする。 [0023] In the sentence unit search method according to the fourth invention, the weighted word group is such that each word is one-dimensional, and the size of the weight value assigned to each word is an element in the dimension direction corresponding to each word. Have A step of calculating as a multidimensional vector, and the step of extracting similar sentence units includes: the multidimensional vector stored for each separated sentence unit; and the multidimensional vector associated with the received word. The method includes a step of calculating a distance and a step of assigning priorities in order of the calculated distance being short V in sentence units, and outputting according to the given priorities.
[0024] 第 5発明に係る文単位検索方法は、文単位又は受け付けた言葉に重み付き単語 群を対応付ける際、各単語が、前記文単位又は前記言葉よりも後続の文単位又は言 葉に出現する又は参照される参照確率を算出する参照確率算出ステップを含み、算 出した参照確率を各単語の重み値として付与することを特徴とする。  [0024] In the sentence unit search method according to the fifth aspect of the present invention, when a weighted word group is associated with a sentence unit or an accepted word, each word appears in the sentence unit or a sentence unit or word subsequent to the word. A reference probability calculation step of calculating a reference probability to be referred to or referred to, and the calculated reference probability is assigned as a weight value of each word.
[0025] 第 6発明に係る文単位検索方法は、前記参照確率算出ステップは、前記各単語が 先行の文単位を含む複数の文単位に出現するパターン、又は前記単語を先行の文 単位力も参照するパターンを含む特徴パターンを特定するステップと、前記文書集 合から取得された文書データ中で、前記特徴パターンと同一の特徴パターンが特定 される単語が、後続の文単位で出現する又は参照される割合を算出するステップとを 含み、算出した割合を参照確率とすることを特徴とする。  [0025] In the sentence unit search method according to the sixth aspect of the invention, the reference probability calculating step refers to a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or refers to the preceding word unit power of the word. A feature pattern including a pattern to be identified, and a word that identifies the same feature pattern as the feature pattern in the document data obtained from the document collection appears or is referenced in subsequent sentence units. Calculating the ratio, and calculating the calculated ratio as a reference probability.
[0026] 第 7発明に係る文単位検索方法は、前記文書集合から抽出される単語毎に、該単 語の前記特徴パターンを特定する特定ステップと、特定した特徴パターンと同一の特 徴パターンが特定される単語が、前記文書データ中で後続の文単位で出現したか又 は参照されたかを判定する判定ステップと、特定した特徴パターンと、該特徴パター ンで特定される単語に対して判定した結果との回帰分析を行って前記参照確率に対 する前記特徴パターンの回帰係数を算出する回帰ステップとを含み、文単位に重み 付き単語群を対応付けて記憶しておく際、又は受け付けた言葉に重み付き単語群を 対応付ける際、前記参照確率算出ステップは、前記文単位又は言葉毎に、該文単 位又は言葉での単語の特徴パターンを特定し、特定した特徴パターンに対する前記 回帰係数を使用して参照確率を算出することを特徴とする。  [0026] In the sentence unit search method according to the seventh invention, for each word extracted from the document set, a specifying step for specifying the feature pattern of the word and a feature pattern identical to the specified feature pattern are provided. A determination step for determining whether the specified word has appeared or referenced in the subsequent sentence unit in the document data, the specified feature pattern, and the determination for the word specified by the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the result of the analysis, and storing or accepting a weighted word group in association with each sentence When associating a weighted word group with a word, the reference probability calculating step specifies a feature pattern of the word in the sentence unit or word for each sentence unit or word, and uses the identified feature pattern. And calculates the reference probability using said regression coefficients.
[0027] 第 8発明に係る文単位検索方法は、文単位に対しては、書き言葉からなる第 1文書 集合力 取得された文書データ中で前記割合を算出し、受け付けた言葉に対しては 、話し言葉力 なる第 2文書集合力 取得された文書データ中で前記割合を算出す ることを特徴とする。 [0027] In the sentence unit search method according to the eighth invention, for the sentence unit, the first document collective power composed of written words is used to calculate the ratio in the acquired document data. Spoken language ability Second document gathering power Calculate the ratio in the acquired document data It is characterized by that.
[0028] 第 9発明に係る文単位検索方法は、書き言葉からなる第 1文書集合及び話し言葉 からなる第 2文書集合夫々について、前記特定ステップ、前記判定ステップ及び前記 回帰ステップを実行しておき、前記参照確率算出ステップは、前記文単位で特定し た単語の特徴パターンに対しては、第 1文書集合について実行した前記回帰ステツ プにより算出された回帰係数を使用して参照確率を算出し、前記受け付けた言葉で 特定した単語の特徴パターンに対しては、第 2文書集合にっ ヽて実行した前記回帰 ステップで算出された回帰係数を使用して参照確率を算出することを特徴とする。  [0028] The sentence unit search method according to the ninth invention performs the specifying step, the determining step, and the regression step for each of the first document set made up of written words and the second document set made up of spoken words, The reference probability calculation step calculates a reference probability using the regression coefficient calculated by the regression step performed on the first document set for the feature pattern of the word specified in the sentence unit, For the feature pattern of the word specified by the accepted word, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.
[0029] 第 10発明に係る文単位検索方法は、前記特徴パターンは、前記単語を先行の文 単位又は言葉から参照している場合の前記先行の文単位又は言葉から前記単語が 含まれる文単位又は言葉までの、文単位又は言葉の数、前記単語が出現又は参照 されている直近の先行の文単位又は言葉における前記単語の係り受け情報、前記 単語が含まれる文単位又は言葉までに出現した又は参照された回数、前記単語が 出現又は参照されている直近の先行の文単位又は言葉における前記単語の名詞区 別、前記単語が出現又は参照されている直近の先行の文単位又は言葉中で前記単 語が主題であるカゝ否カゝ、前記単語が出現又は参照されている直近の先行の文単位 又は言葉中で前記単語が主語であるか否か、前記単語が含まれる文単位又は言葉 における人称、及び、前記単語が含まれる文単位又は言葉における品詞情報、の内 の一又は複数を含む情報で特定されることを特徴とする。  [0029] In the sentence unit search method according to the tenth invention, the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the number of sentence units or words up to the word, the dependency information of the word in the immediately preceding sentence unit or word in which the word appears or is referenced, or the sentence unit or word that contains the word Or the number of times it has been referenced, the noun distinction of the word in the last preceding sentence unit or word in which the word appears or referenced, or in the last preceding sentence unit or word in which the word has appeared or referenced Whether the word is the subject, whether it is the last preceding sentence unit in which the word appears or referenced, whether the word is the subject in the word, the sentence unit in which the word is included, or In words Personal information and sentence units including the word or part-of-speech information in the word.
[0030] 第 11発明に係る文単位検索方法は、前記特徴パターンは、前記単語を先行の文 単位又は言葉から参照している場合の前記先行の文単位又は言葉から前記単語が 含まれる文単位又は言葉までに対応する時間、前記単語が出現又は参照されてい る直近の先行の文単位又は言葉中で前記単語に対応する発話速度、及び、前記単 語が出現又は参照されている直近の先行の文単位又は言葉中で前記単語に対応 する音声の周波数の内の一又は複数を含む情報で特定されることを特徴とする。  [0030] In the sentence unit search method according to the eleventh aspect of the invention, the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the time corresponding to the word, the utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or referenced, and the last preceding sentence in which the word appears or referenced It is specified by information including one or more of voice frequencies corresponding to the word in a sentence unit or word.
[0031] 第 12発明に係る文単位検索方法は、前記文章集合から抽出される単語の内の一 の単語にっ 、て、前記分別された文単位に対応付けられて 、る重み付き単語群の 内から、前記一の単語が含まれる単語群であり、且つ前記一の単語の重み値が所定 値以上である単語群を抽出する第 1ステップと、該第 1ステップで抽出した単語群の 各単語の重み値を単語毎に統合した値を、前記一の単語の各単語への関連度とし て付与した関連単語群を作成する第 2ステップと、作成した関連単語群を前記一の 単語に対応付けて記憶する第 3ステップと、前記抽出された単語夫々について前記 第 1ステップ乃至第 3ステップを予め実行するステップと、文単位毎又は受け付けた 言葉毎に対応付けられた重み付き単語群の各単語の重み値夫々を、各単語に対応 付けて記憶されている前記関連単語群の各単語の関連度を使用して付与し直す関 連度付加ステップとを含むことを特徴とする。 [0031] The sentence unit search method according to the twelfth aspect of the present invention is the weighted word group associated with the sorted sentence unit by one word among the words extracted from the sentence set. A word group including the one word, and the weight value of the one word is predetermined. The first step of extracting a word group that is greater than or equal to the value, and the value obtained by integrating the word weight values of the word group extracted in the first step for each word is defined as the degree of relevance of the one word to each word. A second step of creating the related word group assigned in step 3, a third step of storing the created related word group in association with the one word, and the first to third steps for each of the extracted words Each word of the related word group stored in association with each word, the weight value of each word of the weighted word group associated with each sentence unit or each accepted word. And a relevance addition step for re-assigning using the relevance level.
[0032] 第 13発明に係る文単位検索方法は、前記第 2ステップは、前記抽出した単語群に ついて、各単語群に含まれる各単語の重み値に、前記一の単語の重み値で重み付 けした総和を算出するステップと、算出した総和を平均化するステップと、作成する関 連単語群の各単語の前記関連度として、各単語の重み値の平均化された総和を付 与するステップとを含むことを特徴とする。 [0032] In the sentence unit search method according to the thirteenth invention, in the second step, with respect to the extracted word group, the weight value of each word included in each word group is weighted by the weight value of the one word. A step of calculating the added sum, a step of averaging the calculated sum, and an average sum of weight values of each word is given as the relevance of each word of the related word group to be created And a step.
[0033] 第 14発明に係る文単位検索方法は、前記関連度付加ステップは、文単位毎又は 受け付けた言葉毎に対応付けられた重み付き単語群の各単語について、各単語に 対応付けて記憶されている前記関連単語群に含まれる各単語の関連度を、前記重 み付き単語群の各単語の重み値に乗算するステップと、乗算結果に基づ 、て前記 重み付き単語群の各単語の重み値として付与し直すステップとを含むことを特徴とす る。 [0033] In the sentence unit search method according to the fourteenth invention, the relevance adding step stores each word of the weighted word group associated with each sentence unit or each accepted word in association with each word. Multiplying the degree of relevance of each word included in the related word group by the weight value of each word of the weighted word group, and each word of the weighted word group based on the multiplication result And a step of reassigning as a weight value.
[0034] 第 15発明に係る文単位検索方法は、各単語夫々についての前記関連単語群を、 各単語を 1次元とし、単語毎に付与される関連度の大きさを各単語に対応する次元 方向の要素として持つ多次元の関連度ベクトルとして算出するステップとを含み、前 記関連度付加ステップは、分別した文単位毎に記憶してある前記多次元ベクトルを、 各単語の関連度ベクトルの列によって変換することを特徴とする。  [0034] The sentence unit search method according to the fifteenth aspect of the present invention relates to the related word group for each word, wherein each word is one dimension, and the degree of relevance given to each word is a dimension corresponding to each word. Calculating as a multidimensional relevance vector having a direction element, and the relevance adding step described above uses the multidimensional vector stored for each classified sentence unit as the relevance vector of each word. It is characterized by converting according to a column.
[0035] 第 16発明に係る文単位検索方法は、自然言語からなる複数の文書データが記憶 されている文書集合を用い、言葉を受け付け、受け付けた言葉に基づいて前記文書 集合を検索する文単位検索方法にお!ヽて、前記文書集合から得られる文書データ を一又は複数の文力もなる文単位に分別しておくステップ、分別した文単位毎に、該 文単位に出現する単語、又は、文書データ中の先行の文単位力 参照する単語を 抽出するステップ、前記文単位に対して抽出した単語毎に、各文単位における特徴 を特定して記憶しておくステップ、分別した文単位毎に、該文単位に対して抽出した 単語が該文単位及び先行の文単位で出現する場合の前記特徴の組み合わせのパ ターン、又は先行の文単位力も参照する場合の参照のパターンを含む特徴パターン を特定するステップ、特定した特徴パターンと、該特徴パターンで特定された単語が 後続の文単位で出現又は参照されたか否かとを記憶しておくステップ、前記文書集 合力ゝら得られる文書中の文単位全体に対し、一の特徴パターンで特定される単語が 後続の文単位で出現又は参照される参照確率の回帰分析を行って特徴パターンに 対応する回帰係数を得る回帰学習を実行するステップ、分別した文単位毎に、文書 データ中で先行の文単位力 各文単位に至るまでに抽出された各単語について、前 記文単位で特定される特徴パターンに対応する前記回帰係数を使用し、前記単語 の前記参照確率を算出するステップ、算出した参照確率を夫々付与した重み付き単 語群を対応付けて予め記憶しておくステップ、言葉を受け付けた場合、受け付けた 順に言葉を記憶するステップ、言葉を受け付けた場合、受け付けた言葉に出現する 単語又は前記言葉よりも先に受け付けた言葉力も参照する単語を抽出するステップ 、抽出した各単語の前記受け付けた言葉における特徴を特定するステップ、先に受 け付けた言葉で出現する場合の特徴の組み合わせのパターン、又は先に受け付け た言葉力 参照する場合の参照のパターンを含む特徴パターンを特定するステップ 、特定された特徴パターンに対応する前記回帰係数を使用して、前記単語の前記参 照確率を算出するステップ、算出した参照確率を夫々付与した重み付き単語群を前 記言葉に対応付けるステップ、前記受け付けた言葉と、予め分別されてある文単位と で、対応付けられている重み付き単語群の内の同一の単語毎に付与されている参照 確率の差分を算出するステップ、予め分別されてある文単位に、前記参照確率の差 分が小さい順に優先順位を付与するステップ、及び、前記文単位を付与された優先 順位に基づいて出力するステップを含むことを特徴とする。 [0035] The sentence unit search method according to the sixteenth invention uses a document set in which a plurality of document data consisting of natural language is stored, accepts words, and retrieves the document set based on the accepted words. According to the search method, the step of separating the document data obtained from the document set into sentence units having one or more sentence powers, A step of extracting a word that appears in a sentence unit or a word to be referred to in the preceding sentence unit in document data, and for each word extracted for the sentence unit, a feature in each sentence unit is specified and stored. A step of placing, referring to a pattern of the combination of the features when a word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the preceding sentence unit power A step of specifying a feature pattern including a reference pattern, storing a specified feature pattern and whether or not a word specified in the feature pattern has appeared or referred to in a subsequent sentence unit, A feature pattern is obtained by performing regression analysis of the reference probability that a word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained by the resultant force. Step of executing regression learning to obtain the corresponding regression coefficient, for each sentence unit, each word extracted up to each sentence unit in the document data is identified in the sentence unit. A step of calculating the reference probability of the word using the regression coefficient corresponding to the feature pattern, a step of preliminarily storing a weighted word group assigned with the calculated reference probability, If accepted, a step of storing words in the order of acceptance; if a word is accepted, extracting a word that appears in the accepted word or a word that also refers to the word power received earlier than the word; A step of identifying features in the accepted words, a pattern of combinations of features when appearing in previously accepted words, or a first accepting The step of specifying a feature pattern including a reference pattern when referring to, the step of calculating the reference probability of the word using the regression coefficient corresponding to the specified feature pattern, and the calculated reference A step of associating a weighted word group to which probabilities are respectively assigned to the above-mentioned words, for each of the same words in the weighted word groups associated with the received words and sentence units that are sorted in advance. A step of calculating a difference between assigned reference probabilities, a step of assigning priorities to sentence units that have been sorted in advance, in order of increasing difference of the reference probabilities, and a priority order to which the sentence units are assigned. And a step of outputting based on.
第 17発明に係る文単位検索装置は、自然言語からなる複数の文書データが記憶 されて 、る文書集合から文書データを取得する手段と、言葉を受け付ける手段とを備 え、受け付けた言葉に基づいて前記文書集合を検索する文単位検索装置において 、取得した文書データを一又は複数の文からなる文単位に分別する手段と、取得し た文書データ中に連なる文単位夫々に、各文単位での重み値が付与された前記複 数の単語からなる重み付き単語群を対応付けて記憶する手段と、言葉を受け付けた 場合に受け付けた順に記憶する手段と、新たに言葉を受け付ける都度、該言葉での 重み値が付与された前記複数の単語からなる重み付き単語群を対応付ける手段と、 予め分別された文単位から、受け付けた言葉に対応付けた重み付き単語群と類似 する重み付き単語群が対応付けて記録されている文単位を抽出する手段と、抽出し た文単位を出力する手段とを備えることを特徴とする。 The sentence unit search device according to the seventeenth invention comprises means for acquiring document data from a set of documents in which a plurality of document data consisting of natural language is stored, and means for receiving words. In the sentence unit search device for searching the document set based on the accepted word, a means for separating the acquired document data into sentence units consisting of one or more sentences, and a sentence unit connected in the acquired document data A means for associating and storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence, a means for storing the words in the order received when words are received, and a new Each time a word is received, a means for associating a weighted word group composed of the plurality of words assigned a weight value in the word, and a weighted word group associated with the received word from a pre-sorted sentence unit, It comprises means for extracting a sentence unit in which similar weighted word groups are recorded in association with each other, and means for outputting the extracted sentence unit.
[0037] 第 18発明に係るコンピュータプログラムは、自然言語からなる複数の文書データが 記憶されている文書集合から、文書データを取得することが可能であるコンピュータ を、言葉を受け付ける手段と、受け付けた言葉に基づいて前記文書集合を検索する 手段として機能させることができるコンピュータプログラムにおいて、取得した文書デ ータを一又は複数の文力 なる文単位に分別する手段、取得した文書データ中に連 なる文単位夫々に、各文単位での重み値が付与された前記複数の単語からなる重 み付き単語群を対応付けて記憶する手段、言葉を受け付けた場合に受け付けた順 に記憶する手段、新たに言葉を受け付ける都度、該言葉での重み値が付与された前 記複数の単語からなる重み付き単語群を対応付ける手段、及び、予め分別された文 単位から、受け付けた言葉に対応付けた重み付き単語群と類似する重み付き単語群 が対応付けて記録されている文単位を抽出する手段として機能させることを特徴とす る。  [0037] A computer program according to an eighteenth aspect of the present invention has received a computer capable of acquiring document data from a document set in which a plurality of document data composed of natural language is stored, and means for receiving words. In a computer program that can function as a means for searching the document set based on words, a means for separating the acquired document data into one or more sentence units, which are connected to the acquired document data Means for storing a weighted word group composed of the plurality of words assigned with a weight value for each sentence unit in association with each sentence unit, means for storing in the order received when words are received, new Each time a word is received, a means for associating a weighted word group consisting of a plurality of words to which a weight value for the word is assigned, It is characterized by functioning as means for extracting sentence units in which weighted word groups similar to weighted word groups associated with received words are recorded in association with the received words.
[0038] 第 19発明に係るコンピュータで読み取り可能な記録媒体には、第 18発明のコンビ ユータブログラムが記録されて 、ることを特徴とする。  [0038] The computer-readable recording medium according to the nineteenth aspect of the invention is characterized in that the computer program of the eighteenth aspect of the invention is recorded.
[0039] 第 20発明に係る文書記憶装置は、自然言語からなる複数の文書データを記憶す る手段と、記憶した文書データを、文書データの先頭から順に一又は複数の文から なる文単位に分別する手段とを備え、分別した文単位毎に、該文単位に出現する単 語又は先行する文単位カゝら参照する単語が抽出してあり、分別した文単位毎に抽出 した単語が記憶してある文書記憶装置において、文書データ中に連なる文単位夫々 に、各文単位での重み値が付与された前記複数の単語からなる重み付き単語群を 対応付けて記憶する手段を備えることを特徴とする。 [0039] The document storage device according to the twentieth invention is a means for storing a plurality of document data composed of a natural language, and the stored document data is divided into sentence units composed of one or a plurality of sentences in order from the top of the document data. For each sentence unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each sentence unit. Each sentence unit in the document data in the document storage device And means for storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence in association with each other.
[0040] 第 21発明に係る文書記憶装置は、抽出されてある単語の内の一の単語について、 文単位夫々に対応付けられている重み付き単語群の内から、前記一の単語が含ま れる単語群であり、且つ前記一の単語の重み値が所定値以上である単語群を抽出 する抽出手段と、該抽出手段が抽出した単語群の各単語の重み値を単語毎に統合 した値を、前記一の単語の各単語への関連度として付与した関連単語群を作成する 作成手段と、作成した関連単語群を前記一の単語に対応付けて記憶する記憶手段 とを備え、前記抽出されてある単語夫々について前記抽出手段、前記作成手段及び 前記記憶手段の処理を実行するようにしてあり、各単語に対応付けて夫々の関連単 語群を記憶するようにしてあることを特徴とする。  [0040] In the document storage device according to the twenty-first invention, for one word among the extracted words, the one word is included from the weighted word group associated with each sentence unit. An extraction means for extracting a word group that is a word group and the weight value of the one word is equal to or greater than a predetermined value, and a value obtained by integrating the weight value of each word of the word group extracted by the extraction means for each word. A means for creating a related word group given as a degree of relevance to each word of the one word, and a storage means for storing the created related word group in association with the one word, the extracted The processing of the extraction means, the creation means, and the storage means is executed for each word, and each related word group is stored in association with each word. .
[0041] 第 1発明、第 17発明、第 18発明及び第 19発明では、自然言語からなる文書デー タが記録された文書集合から文書データが取得され、取得された文書データは更に 一又は複数の文である文単位に分別される。文単位毎に、文書集合中で出現する 各単語につ 、てその文単位での重み値が付与され、重み値が付与された単語の重 み付き単語群が文単位に対応付けて記憶される。言葉を受け付けた場合、受け付け た言葉についてもその言葉での重み値が付与された単語の重み付き単語群が対応 付けられる。予め分別されている文単位から、受け付けた言葉に対応付けられた重 み付き単語群と類似する重み付き単語群が対応付けられている文単位が抽出され、 出力される。  [0041] In the first invention, the seventeenth invention, the eighteenth invention, and the nineteenth invention, document data is acquired from a document set in which document data composed of natural language is recorded, and the acquired document data is further one or more. The sentence is divided into sentence units. For each sentence unit, each word that appears in the document set is given a weight value in that sentence unit, and a weighted word group of words assigned the weight value is stored in association with each sentence unit. The When a word is accepted, the weighted word group of words to which the weight value for the word is assigned is also associated with the accepted word. A sentence unit that is associated with a weighted word group similar to the weighted word group associated with the accepted word is extracted from the sentence units that have been sorted in advance and output.
[0042] 第 2発明では、第 1発明にお 、て類似する重み付き単語群が対応付けられて 、る 文単位を抽出する際、予め文単位に対応付けて記憶されている重み付き単語群の 内の複数の単語の重み値の分布が、受け付けた言葉に対応付けられた重み付き単 語群の内の複数の単語の重み値の分布と所定の条件を満たすか否かの判断により 類似するか否かが判定され、類似すると判定された重み付き単語群が対応付けられ ている文単位が抽出される。  [0042] In the second invention, the weighted word group stored in advance in association with the sentence unit when extracting the sentence unit associated with the similar weighted word group in the first invention. The distribution of the weight values of multiple words in is similar to the distribution of the weight values of multiple words in the weighted word group associated with the received word by determining whether or not a predetermined condition is satisfied. The sentence unit associated with the weighted word group determined to be similar is extracted.
[0043] 第 3発明では、第 1発明又は第 2発明において類似する重み付き単語群が対応付 けられている文単位を抽出する際、重み付き単語群に同一の単語が含まれる文単位 が抽出され、その同一の単語に付与されて 、る重み値の差分が小さ 、順に優先順 位が付与される。 [0043] In the third invention, when extracting a sentence unit associated with a similar weighted word group in the first invention or the second invention, a sentence unit in which the same word is included in the weighted word group Are extracted and assigned to the same word, and the difference between the weight values is small, and the priority order is assigned in order.
[0044] 第 4発明では、第 1発明における重み付き単語群は、各単語を 1次元とし、単語毎 に付与される重み値の大きさを各単語に対応する次元方向の要素として持つ多次元 ベクトルとして得られる。重み付き単語群が類似するカゝ否かの判定を、重み付き単語 群同士、即ち多次元べ外ル間の距離が短いか否かで判定される。抽出された文単 位は、多次元べ外ル間の距離が短い順、即ち重み付き単語群同士が類似する順に 出力される。  [0044] In the fourth invention, the weighted word group in the first invention is a multidimensional having each word as one dimension and having a weight value given to each word as a dimension element corresponding to each word. Obtained as a vector. Whether or not the weighted word groups are similar is determined based on whether or not the distance between the weighted word groups, that is, the distance between the multidimensional envelopes is short. The extracted sentence units are output in the order in which the distance between the multidimensional outer regions is short, that is, the weighted word groups are similar.
[0045] 第 5発明では、第 1発明乃至第 4発明において各単語に付与される重み値として、 各単語が夫々、後続の文単位又は言葉に出現する又は参照される参照確率が算出 されて付与される。  [0045] In the fifth invention, as the weight value assigned to each word in the first invention to the fourth invention, a reference probability that each word appears or is referred to in the subsequent sentence unit or word is calculated. Is granted.
[0046] 第 6発明では、第 5発明において算出される参照確率は、各単語に対して特定され る先行の文単位力 各文単位に至るまでの出現のパターン、又は先行の文単位から の参照のパターンを含む特徴パターンと同一の特徴パターンが特定される単語が、 文書集合中で後続の文単位でさらに出現する又は参照される割合として算出される  [0046] In the sixth invention, the reference probability calculated in the fifth invention is the preceding sentence unit power specified for each word, the pattern of appearance up to each sentence unit, or from the preceding sentence unit Calculated as the rate at which words with the same feature pattern as the feature pattern including the reference pattern appear or are referenced in subsequent sentence units in the document set
[0047] 第 7発明では、文書集合力も抽出される各単語に対し特定される特徴パターンと、 その特徴パターンが特定される単語が文書集合中の文書中の後続の文単位で出現 したか又は参照されたかの判定結果とが回帰分析され、単語が後続の文単位で出 現又は参照される参照確率に対する特徴パターンの回帰係数が算出される。第 5発 明において算出される参照確率は、単語毎に夫々の特徴パターンが特定され、その 特徴パターンと回帰係数とから算出される。 [0047] In the seventh invention, the feature pattern specified for each word from which document collection power is also extracted, and the word for which the feature pattern is specified has appeared in subsequent sentence units in the document in the document set, or A regression analysis is performed on the determination result of whether the word is referred to, and a regression coefficient of the feature pattern with respect to the reference probability that the word appears or is referenced in the subsequent sentence unit is calculated. The reference probabilities calculated in the fifth invention are calculated from the feature patterns and regression coefficients of each word specified for each word.
[0048] 第 8発明及び第 9発明では、文書集合が書き言葉からなる第 1文書集合と、話し言 葉力 なる第 2文書集合とに分けられて用いられる。文単位に対応付けられる重み付 き単語群の各単語へ付与する参照確率は、第 1文書集合に基づいて算出され、受け 付けた言葉に対応付けられる重み付き単語群の各言葉へ付与する参照確率は、第 2文書集合に基づいて算出される。  In the eighth and ninth inventions, the document set is divided into a first document set made up of written words and a second document set made up of spoken word power. The reference probability given to each word in the weighted word group associated with the sentence unit is calculated based on the first document set, and the reference given to each word in the weighted word group associated with the accepted word The probability is calculated based on the second document set.
[0049] 第 10発明では、第 6発明乃至第 9発明において参照確率を算出する際に、各単語 の特徴パターンを特定するための特徴として、先行の文単位又は言葉で出現又は参 照している場合の現在の文単位又は言葉に至るまでの数、出現又は参照した場合 の単語の係り受け情報、出現した回数又は参照された回数、単語の名詞区別、単語 が主題であるか、単語が主語であるか、単語の人称、単語の品詞情報等の情報が定 量的に扱われる。 [0049] In the tenth invention, in calculating the reference probability in the sixth invention to the ninth invention, each word Dependent information on the number of words up to and including the current sentence unit or word when appearing or referenced in the preceding sentence unit or word Information such as the number of occurrences or references, word noun distinction, whether the word is the subject, whether the word is the subject, word personality, word part-of-speech information, etc. is quantified.
[0050] 第 11発明では、第 6発明乃至第 10発明において参照確率を算出する際に、各単 語の特徴パターンを特定するための特徴として、先行の文単位又は言葉で出現又は 参照している場合に先行の文単位又は言葉からの時間、出現又は参照した場合の その単語に相当する音声の発話速度、音声の周波数の高低の情報が定量的に扱わ れる。  [0050] In the eleventh invention, when the reference probability is calculated in the sixth invention to the tenth invention, it appears or refers to the preceding sentence unit or word as a feature for specifying the feature pattern of each word. If it is, the time from the preceding sentence unit or word, the speech rate corresponding to the word when it appears or referenced, and the information of the high and low frequency of the voice are handled quantitatively.
[0051] 第 12発明では、第 1発明乃至第 11発明において、文書集合力も抽出される単語
Figure imgf000016_0001
、て、その単語の重み値が所定値以上の重み付き単語群が抽 出される。その一の単語にっ ヽて抽出された複数の重み付き単語群の各単語の重 み値を単語毎に統合した一の重み付き単語群が関連単語群として作成される。作成 された関連単語群の各単語の関連度は、一の単語に所定値以上の重み値が付与さ れて 、る場合の各単語の重み値への関連の深さを表わして 、る。文書集合から抽出 される単語夫々に対して関連単語群が生成され記憶される。各文単位又は言葉に対 応付けられた重み付き単語群の各単語の重み値が、夫々の単語に対応付けられた 関連単語群の各単語の関連度を使用して付与し直される。
[0051] In the twelfth invention, in the first invention to the eleventh invention, the word from which the document collecting power is also extracted.
Figure imgf000016_0001
Then, a weighted word group whose weight value is not less than a predetermined value is extracted. One weighted word group is created as a related word group by integrating the weight value of each word of a plurality of weighted word groups extracted from the one word. The degree of relevance of each word in the created related word group represents the depth of relation to the weight value of each word when a weight value greater than a predetermined value is given to one word. A group of related words is generated and stored for each word extracted from the document set. The weight value of each word of the weighted word group associated with each sentence unit or word is reassigned using the relevance level of each word of the related word group associated with each word.
[0052] 第 13発明では、第 12発明において一の単語に対する関連単語群が作成される際 、一の単語の重み値が所定値以上である重み付き単語群として抽出された単語群が 、その重み付き単語群での前記一の単語に対する重み値によって重み付けされた総 和が算出される。総和は平均化され、各単語について平均化された重み値の総和が 関連単語群の各単語の関連度として付与される。  [0052] In the thirteenth invention, when the related word group for one word is created in the twelfth invention, the word group extracted as a weighted word group whose weight value is greater than or equal to a predetermined value is The sum total weighted by the weight value for the one word in the weighted word group is calculated. The sum is averaged, and the sum of the weight values averaged for each word is given as the relevance of each word in the related word group.
[0053] 第 14発明では、前記 12発明又は第 13発明で記憶される関連単語群の各単語の 関連度が、文単位毎又は受け付けた言葉毎に対応付けられた重み付き単語群の各 単語の重み値に乗算され、乗算結果が重み付き単語群の各単語の重み値として付 与し直される。重み付き単語群の内の一の単語に注目した場合、一の単語に対応付 けられた関連単語群の各単語の関連度が使用される。重み付き単語群の内の一の 単語以外の各単語の重み値と、前記一の単語に対応付けられた関連単語群の各単 語の関連度とが乗算されることにより、関連度の高い他の単語の重み値からの前記 一の単語の重み値への影響が加味される。 [0053] In the fourteenth invention, each word of the weighted word group in which the relevance level of each word of the related word group stored in the twelfth or thirteenth invention is associated with each sentence unit or each accepted word. And the multiplication result is reassigned as the weight value of each word in the weighted word group. If attention is paid to one word in the weighted word group, it corresponds to one word. The relevance level of each word in the related word group is used. Higher relevance is obtained by multiplying the weight value of each word other than one word in the weighted word group by the relevance level of each word of the related word group associated with the one word. The influence of the weight value of the one word from the weight value of another word is taken into account.
[0054] 第 15発明では、第 12発明乃至第 14発明における関連単語群は、各単語を 1次元 とし、単語毎に付与される関連度の大きさを各単語に対応する次元方向の要素とし て持つ多次元の関連度ベクトルとして得られる。各文単位又は言葉に対応付けられ た多次元ベクトルは、各単語に対する関連語ベクトルの列力 なる行列で変換される 。即ち、多次元ベクトルは単語の各 1次元間の距離が関連度が高い単語の次元間ほ ど距離が短い斜交座標系における多次元ベクトルで表現される。したがって、重み付 き単語群を表現する多次元ベクトルは、それに含まれる単語と関連度が高 、単語軸 方向に回転され、関連度が高い単語を含む多次元ベクトル間の距離はより短くなる。  [0054] In the fifteenth invention, in the related word group in the twelfth invention to the fourteenth invention, each word is one-dimensional, and the degree of relevance given to each word is a dimensional element corresponding to each word. Obtained as a multidimensional relevance vector. The multidimensional vector associated with each sentence unit or word is converted by a matrix of column power of related word vectors for each word. In other words, the multidimensional vector is represented by a multidimensional vector in an oblique coordinate system in which the distance between each one-dimensional word is high in the degree of relevance and the distance between the words is short. Therefore, a multidimensional vector representing a weighted word group has a high degree of association with a word included therein and is rotated in the direction of the word axis, and the distance between the multidimensional vectors including a word with a high degree of association is shorter.
[0055] 第 16発明では、文書集合から取得された文書データを更に分別した文単位毎に、 文単位又は先行の文単位から参照する単語が抽出され、各単語に対して各文単位 における特徴が特定され、先行の文単位力 各文単位に至るまでの特徴の組み合 わせのパターン、又は各単語の先行の文単位からの参照のパターンを含む特徴パタ ーンが特定される。特定された特徴パターンによる参照確率の回帰学習に基づいて 、抽出された各単語の参照確率が算出され、重み付き単語群として予め文単位毎に 記憶される。受け付けた言葉に対しても先行の言葉に基づいた特徴パターンが特定 されて各単語の参照確率が算出され、重み付き単語群が対応付けられる。予め記憶 してある文単位は、受け付けた言葉の重み付き単語群と同一の単語の参照確率の 差分が小さい順に優先順位が付与されて出力される。  [0055] In the sixteenth aspect, for each sentence unit obtained by further sorting the document data acquired from the document set, a word to be referred to from the sentence unit or the preceding sentence unit is extracted. A sentence pattern is identified, and a feature pattern including a pattern of combination of features leading to each sentence unit or a reference pattern from a preceding sentence unit of each word is identified. Based on regression learning of reference probabilities using the identified feature patterns, the reference probabilities for each extracted word are calculated and stored in advance as sentence-weighted word groups for each sentence. A feature pattern based on the preceding word is also specified for the accepted word, the reference probability of each word is calculated, and a weighted word group is associated. Pre-stored sentence units are output with priorities assigned in ascending order of difference in reference probabilities for the same word as the weighted word group of accepted words.
[0056] 第 20発明では、文書集合から取得された文書データを更に分別した文単位毎に、 その文単位での単語の重みが付与された重み付き単語群が対応付けられて記憶さ れる。  [0056] In the twentieth invention, for each sentence unit obtained by further sorting the document data acquired from the document set, a weighted word group to which a word weight is assigned in that sentence unit is stored in association with each other.
[0057] 第 21発明では、第 12発明で文書から抽出されてある単語夫々について作成され た関連単語群が記憶される。  In the twenty-first invention, related word groups created for each word extracted from the document in the twelfth invention are stored.
発明の効果 [0058] 本発明による場合、文書集合力 取得した文書データ中の一又は複数の文力 な る文単位毎に、複数の単語夫々の当該文単位での重み値を付与した重み付き単語 群が対応付けられて記憶される。重み値付き単語群は、各文単位での各単語の重 み値の組であり、文単位毎の意味のまとまりを示す情報として推定することができる。 各重み値に先行の文単位力 続く文脈が反映された値が付与されていることにより、 分別された連なる文単位中の各文単位での重み付き単語群は、文書全体での意味 のまとまりと異なり、文書中にある先行の文から続く文脈の流れの中で、動的に時系 列的に変化していく意味のまとまりとして捉えることができる。検索のために入力され る言葉での重み値が付与された重み付き単語群と類似する重み付き単語群が対応 付けられる文単位が抽出されることにより、文書全体ではなぐ単語の顕現性、即ち 意味のまとまりが類似する文単位を直接的に検索することができる。 The invention's effect [0058] According to the present invention, a weighted word group in which a weight value for each sentence unit of a plurality of words is assigned to each sentence unit having one or more sentence powers in the acquired document data. Correspondingly stored. The word group with weight values is a set of weight values of each word in each sentence unit, and can be estimated as information indicating a group of meanings in each sentence unit. By assigning each weight value a value that reflects the preceding sentence unit power and subsequent context, the weighted word group in each sentence unit in the separated sentence unit is a group of meanings in the whole document. In contrast, it can be understood as a group of meanings that dynamically change in a time series in the context flow that follows the previous sentence in the document. By extracting sentence units to which weighted word groups similar to weighted word groups given weight values in terms of words input for search are extracted, the word manifestation of the whole document, that is, Sentence units with similar meanings can be directly searched.
[0059] また、重み付き単語群が類似する力否かは、受け付けた言葉の重み付き単語群の 内の複数の単語の重み値の分布と、予め記憶してある重み付き単語群の内の複数 の単語の重み値の分布とを比較した場合に、分布同士が類似であると判断できる所 定の条件を満たすとき、記憶してある重み付き単語群が受け付けた言葉の重み付き 単語群と類似するということができる。例えば、重み付き単語群同士が類似していると 判断できる所定の条件を、各単語の重み値の分布が相似であると!/ヽえる条件とした 場合、重み付き単語群が類似しているということができる。つまり、一方の重み付き単 語群にお 、て一の単語の重み値の他の単語の重み値に対する比率力 他方の重み 付き単語群における一の単語の重み値の他の単語の重み値に対する比率にも保存 される場合、それらの重み付き単語群同士は類似していると判断することができる。ま た、所定の条件を、例えば、一又は複数の単語に注目した場合にその単語の重み値 がいずれも所定値以上である力否かに設定することで判断することもできる。また、受 け付けた言葉に対応付けた重み付き単語群と、予め分別された文単位に対応付けら れて 、る重み付き単語群と比較した場合に、同一の単語の重み値の差分が小さ 、か 否かにより類似するカゝ否かを判断することもできる。  [0059] In addition, whether or not the weighted word group is similar is determined based on the distribution of the weight values of a plurality of words in the weighted word group of the accepted words and the weighted word group stored in advance. When comparing the distribution of weight values of multiple words and satisfying a predetermined condition that the distributions can be judged to be similar to each other, the weighted word group of words received by the stored weighted word group and It can be said that they are similar. For example, if the predetermined condition that can determine that the weighted word groups are similar is a condition that the distribution of the weight value of each word is similar! It can be said. That is, in one weighted word group, the ratio of the weight value of one word to the weight value of another word, the weight value of one word in the other weighted word group to the weight value of another word When the ratio is also stored, it can be determined that the weighted word groups are similar to each other. In addition, for example, when one or more words are focused on, the predetermined condition can be determined by setting whether or not the weight value of each word is equal to or greater than a predetermined value. In addition, when compared with the weighted word group associated with the received word and the weighted word group associated with the sentence unit that has been sorted in advance, the difference between the weight values of the same word is obtained. It is also possible to determine whether or not it is similar depending on whether it is small or not.
[0060] また、重み付き単語群を、各単語を 1次元として、各単語の文単位又は言葉での重 み値を各次元成分に対する要素として持つ多次元ベクトルとして表現することにより 、文単位又は言葉毎の意味のまとまりを定量的なベクトルとして扱うことができる。また 、文単位又は言葉毎の意味のまとまりを定量的な多次元ベクトルとして扱うことにより 、ベクトル演算が可能なコンピュータを利用して、受け付けた言葉に対応付けられた ベクトルと記憶してある文単位毎に対応付けられたベクトルとの距離を算出することに よって類似する文単位を直接的に抽出することができる。さらに、多次元ベクトルとし て表現することによって、受け付けた言葉、又は予め分別された文単位の多次元べ タトルが満たす条件を、多次元空間上のどの空間に相当する力否かによって設定す ることができ、類似する文単位を直接的に抽出することができる。 [0060] Further, by expressing the weighted word group as a multi-dimensional vector having each word as one dimension and having the sentence unit of each word or the weight value in the word as an element for each dimension component. A group of meanings for each sentence or word can be treated as a quantitative vector. In addition, by treating a sentence unit or a group of meanings for each word as a quantitative multidimensional vector, using a computer capable of vector calculation, a sentence unit stored as a vector associated with the accepted word Similar sentence units can be directly extracted by calculating the distance to the vector associated with each. Furthermore, by expressing it as a multi-dimensional vector, the conditions that the accepted words or the multi-dimensional vector of sentence units sorted in advance are satisfied are set according to which space in the multi-dimensional space corresponds to power or not. And similar sentence units can be extracted directly.
[0061] なお、ここで 、う文書集合は、 V、わゆる書き言葉力 なる文書データの集合に限ら ない。したがって、それらを分別した文単位も書き言葉力もなる文単位とは限らない。 文書データは既に記憶されてあるデータを意味してリアルタイムに受け付ける言葉と 区別するものであり、話し言葉による対話が順に書下された文書データでもよい。  [0061] Here, the document set is not limited to V, a set of document data having a so-called written language ability. Therefore, it is not necessarily a sentence unit in which they are separated and a sentence unit that has written language ability. Document data means data that has already been stored and is distinguished from words that are received in real time, and may be document data in which spoken dialogues are written in order.
[0062] また、受け付ける言葉は、検索の目的で入力される単語、文章等に限らず、例えば ユーザ同士の対話中の音声を含む各発話でもよい。各発話での重み値が付与され た重み付き単語群に基づいて文単位を抽出するので、対話中で発話毎に意味が動 的に、時系列的に変化していくことを考慮した意味のまとまりを発話毎に推定すること ができる。したがって、各発話に対して推定される意味のまとまりに類似する文単位を 抽出して提示することが可能になる。  [0062] In addition, the accepted words are not limited to words, sentences, and the like that are input for the purpose of search, but may be, for example, each utterance including a voice during a dialogue between users. Sentence units are extracted based on weighted word groups to which weight values for each utterance are assigned, so that the meaning is considered considering that the meaning changes dynamically and chronologically for each utterance during the conversation. A cluster can be estimated for each utterance. Therefore, it is possible to extract and present sentence units similar to the presumed meaning group for each utterance.
[0063] さらに、本発明による場合、重み付き単語群の各単語の重み値を、後続の文単位 又は言葉でも出現又は参照される参照確率として付与することにより、各単語の重み 値を注目されて 、る度合 、、即ち顕現性を示す定量的な値で表わすことができる。 文脈上のその文単位において重要な注目されている単語は、継続して出現又は参 照される確率が高いと考えられる。したがって、参照確率はその文単位における各単 語の注目されて 、る度合 、、即ち顕現性を示すと 、うことができる。  [0063] Further, according to the present invention, the weight value of each word of the weighted word group is given as a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed. Thus, it can be expressed by a quantitative value indicating the degree, that is, the manifestation. Words that are important in the contextual sentence unit are considered likely to continue to appear or be referenced. Therefore, the reference probability can be expressed as the degree to which each word in the sentence unit is noticed, that is, the manifestation.
[0064] また、各文単位で実際に出現することなしに指示代名詞又はゼロ代名詞で表わさ れる単語、又は指示代名詞又はゼロ代名詞でも表わされて!/ヽな!、単語であっても、 文単位又は言葉に実際に出現していない単語であっても後続の文単位又は言葉で 出現又は参照される単語は、その文単位又は言葉での顕現性が高 、と考えられる。 各文単位を基準とした先行の複数の文単位での単語の特徴パターンに基づいて参 照確率を算出するので、実際に出現していない単語であっても、顕現性の高さをより 正しく定量的に表わすことができる。 [0064] In addition, a word that is expressed by a demonstrative pronoun or zero pronoun, or a demonstrative pronoun or zero pronoun without actually appearing in each sentence unit! A word that appears or is referenced in a subsequent sentence unit or word, even if it does not actually appear in the unit or word, is considered to be highly apparent in that sentence unit or word. Since the reference probability is calculated based on the feature pattern of the word in the preceding multiple sentence units based on each sentence unit, even if the word does not actually appear, the level of visibility is more accurately It can be expressed quantitatively.
[0065] さらに、言葉を音声で受け付けた場合は、言葉が発声されたときの声の特徴、即ち 話す速度、声調からも、その言葉に含まれる単語がその言葉で重みを持っているの か否力を定量的に特徴づけて各単語の顕現性の高さを表わすことができる。  [0065] Furthermore, when a word is received by voice, whether or not the word included in the word has weight in the word from the characteristics of the voice when the word is uttered, that is, speaking speed and tone. Can be quantitatively characterized to express the high manifestation of each word.
[0066] さらに、本発明による場合、検索結果として出力する文単位が書き言葉である場合 は、書き言葉力もなる文書集合に基づいて参照確率を算出し、受け付けた言葉が話 し言葉である場合は、話し言葉力もなる文書集合に基づいて参照確率を学習、算出 する。これにより、書き言葉と話し言葉とで異なる特徴を踏まえて、より意味合いが似 た文単位を出力することができる。  [0066] Further, according to the present invention, when a sentence unit to be output as a search result is a written word, a reference probability is calculated based on a document set having written language ability, and when a received word is a spoken word, The reference probability is learned and calculated based on a document set that also has spoken language skills. As a result, sentence units with more similar meanings can be output based on the characteristics that differ between written and spoken language.
[0067] また、本発明による場合、単語毎に各単語からの関連度を定量的に算出して記憶 しておく。重み付き単語群の内の各単語の重み値を、他の単語の重み値と、各単語 力 の前記一への単語の関連度とに基づいて算出し直す。これにより、一の単語の 重み値に対し、他の単語の内の一の単語に対する関連度が高い単語の重み値の影 響を反映させることができる。つまり、一の単語に対する関連度が高い単語の重み値 が高 、場合は、一の単語の重み値が高くなることを再現することができる。  [0067] Also, according to the present invention, the degree of association from each word is quantitatively calculated and stored for each word. The weight value of each word in the weighted word group is recalculated based on the weight value of the other word and the relevance of the word to each of the word forces. As a result, the weight value of one word can reflect the influence of the weight value of a word having a high degree of association with one word among other words. That is, when the weight value of a word having a high degree of association with one word is high, it can be reproduced that the weight value of one word is high.
[0068] 一の単語に対する関連語群を関連度ベクトルとして表現し、重み付き単語群を多 次元ベクトルで表現した場合に各単語に対する関連度ベクトルの列からなる行列で 多次元ベクトルを変換することにより、関連度の強い単語を含む重み付き単語群を表 現する多次元ベクトル間の距離が短くなる。  [0068] When a related word group for one word is expressed as a relevance degree vector and a weighted word group is expressed as a multidimensional vector, the multidimensional vector is converted with a matrix composed of a sequence of relevance vectors for each word. This shortens the distance between the multidimensional vectors representing the weighted word group including the words having a high degree of association.
[0069] これ〖こより、重み付き単語群の内の一の単語以外の単語の内、前記一の単語への 関連度が高 、単語の重み値の影響を、前記一の単語の重み値に反映することがで きる。各文単位又は言葉での各単語の顕現性に関連度を反映させて、受け付けた言 葉に表れて 、な 、場合であってもユーザに意識されて 、る単語を連想させる文単位 を効果的に検索することができる等の優れた効果を奏する。  [0069] As a result, among words other than one word in the weighted word group, the degree of relevance to the one word is high, and the influence of the word weight value is used as the weight value of the one word. It can be reflected. Reflecting the degree of relevance in the manifestation of each word in each sentence unit or word, the sentence unit that appears in the accepted word, even if it is recognized by the user, is effective It has an excellent effect such as being able to search automatically.
図面の簡単な説明  Brief Description of Drawings
[0070] [図 1]本発明に係る文単位検索方法の概要を示す説明図である。 圆 2]実施の形態 1における文単位検索装置を用いた検索システムの構成を示すブ ロック図である。 FIG. 1 is an explanatory diagram showing an outline of a sentence unit search method according to the present invention. [2] FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search device according to the first embodiment.
圆 3]実施の形態 1における文単位検索装置の CPUが、取得した文書データに対す る形態素解析及び統語解析処理の解析結果カゝらタグ付け及び単語抽出を行い記憶 する処理手順を示すフローチャートである。 圆 3] A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the first embodiment performs tagging and word extraction on the acquired morphological analysis and syntactic analysis processing on the acquired document data and stores them. is there.
[図 4]実施の形態 1における文書記憶手段で記憶される文書データの内容の一例を 示す説明図である。  FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means in the first embodiment.
圆 5]実施の形態 1における文単位検索装置の CPUが、形態素解析及び統語解析 した結果を付与して文書記憶手段に記憶させる文書データの一例を示す説明図で ある。 [5] FIG. 5 is an explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of morphological analysis and syntactic analysis and stores in the document storage means.
圆 6]実施の形態 1における文単位検索装置の CPUが取得した全文書データ力 抽 出した単語のリストの例を示す説明図である。 [6] FIG. 6 is an explanatory diagram showing an example of a list of extracted words for all document data acquired by the CPU of the sentence unit search device according to the first embodiment.
[図 7]実施の形態 1における文単位検索装置の CPUが、文書記憶手段で記憶してい るタグ付け済み文書データカゝらサンプルを抽出し、回帰分析を行って参照確率を算 出するための回帰式を推定する処理手順を示すフローチャートである。  [FIG. 7] The CPU of the sentence unit search apparatus according to Embodiment 1 extracts a sample from the tagged document data stored in the document storage means and performs a regression analysis to calculate the reference probability. It is a flowchart which shows the process sequence which estimates a regression type.
[図 8]実施の形態 1における文書記憶手段で記憶された文書データ中の文で特定さ れる特徴パターンの例を示す説明図である。 FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in document data stored in the document storage unit in the first embodiment.
[図 9]実施の形態 1における文単位検索装置の CPUが、文書記憶手段で記憶して!/、 るタグ付け済みの文書データの文毎に単語の参照確率を算出し、記憶する処理手 順を示すフローチャートである。  FIG. 9 is a processing procedure for calculating and storing a word reference probability for each sentence of tagged document data stored in the document storage means by the CPU of the sentence unit search apparatus according to the first embodiment. It is a flowchart which shows order.
[図 10]実施の形態 1における文単位検索装置の CPUが、文書記憶手段で記憶して いるタグ付け済みの文書データの文毎に単語の参照確率を算出し、記憶する処理 手順を示すフローチャートである。  FIG. 10 is a flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 1 calculates and stores a word reference probability for each sentence of tagged document data stored in the document storage means. It is.
圆 11]実施の形態 1における文単位検索装置の CPUが、文書データに示される文 書を文毎に分別した一例を示す説明図である。 [11] FIG. 11 is an explanatory diagram showing an example in which the CPU of the sentence unit search device in Embodiment 1 sorts the document shown in the document data for each sentence.
圆 12]実施の形態 1における文単位検索装置の CPUが、参照確率を算出した結果 を付与して文書記憶手段に記憶させる文書データの一例を示す説明図である。 圆 13]実施の形態 1における文単位検索装置の CPUが、文単位毎に算出した重み 付き単語群を索引付けして記憶した場合のデータベースの内容例を示す説明図で ある。 12] An explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of calculating the reference probability and stores it in the document storage means.圆 13] Weight calculated for each sentence unit by the CPU of the sentence unit retrieval apparatus in the first embodiment It is explanatory drawing which shows the example of the content of the database at the time of indexing and memorizing an attached word group.
圆 14]文単位検索装置の CPUにより文毎に記憶される単語及び該単語に対して算 出された参照確率の組が、文が続くにつれてどのように変化するかを示す説明図で ある。 [14] FIG. 14 is an explanatory diagram showing how a set of words stored for each sentence by the CPU of the sentence unit search apparatus and a reference probability calculated for the word changes as the sentence continues.
[図 15]実施の形態 1における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。  FIG. 15 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
[図 16]実施の形態 1における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。  FIG. 16 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
[図 17]実施の形態 1における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。  FIG. 17 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
圆 18]実施の形態 1における文単位検索装置の CPUが、受付装置力も受信したテキ ストデータに対して特定した特徴パターンの例を示す説明図である。 [18] FIG. 18 is an explanatory diagram showing an example of a feature pattern specified for text data in which the CPU of the sentence unit search device according to the first embodiment also receives the receiving device power.
[図 19]実施の形態 2における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。 FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the second embodiment.
圆 20]実施の形態 3における本発明の検索方法に関わる、一の単語と関連の深い単 語の顕現性の影響の概要を示す説明図である。 [20] FIG. 20 is an explanatory diagram showing an outline of the influence of the manifestation of a word closely related to one word, related to the search method of the present invention in Embodiment 3.
圆 21]実施の形態 3における文単位検索装置の CPUが関連語群を作成する処理手 順を示すフローチャートである。 21] A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the third embodiment creates a related word group.
圆 22]実施の形態 3における文単位検索装置の CPUが関連語群を作成する処理手 順を示すフローチャートである。 22] A flowchart showing a processing procedure in which the CPU of the sentence unit search apparatus according to the third embodiment creates a related word group.
圆 23]実施の形態 3における文単位検索装置の CPUによって関連語群が作成され る場合の、各処理の過程での重み付き単語群の例を示す説明図である。 23] An explanatory diagram showing an example of a weighted word group in each process when a related word group is created by the CPU of the sentence unit search device in the third embodiment.
圆 24]実施の形態 3における文単位検索装置の CPUが、各文単位に対応付けられ て記憶されている重み付き単語群の各単語の重み値を算出し直す処理手順を示す フローチャートである。 24] A flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit.
圆 25]実施の形態 3における文単位検索装置の CPUが、各文単位に対応付けられ て記憶されている重み付き単語群の各単語の重み値を算出し直す処理手順の詳細 を示すフローチャートである。 [25] Details of the processing procedure in which the CPU of the sentence unit search device in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit It is a flowchart which shows.
[図 26]実施の形態 3における文単位検索装置の CPUによって算出された各単語の 顕現性を表わす重み値の内容例を示す説明図である。  FIG. 26 is an explanatory diagram showing an example of the content of a weight value representing the manifestation of each word calculated by the CPU of the sentence unit search device in the third embodiment.
[図 27]実施の形態 3における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。  FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
[図 28]実施の形態 3における文単位検索装置及び受付装置の検索処理の処理手順 を示すフローチャートである。  FIG. 28 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
[図 29]本発明の文単位検索方法を文単位検索装置で実施する場合の構成を示すブ ロック図である。  FIG. 29 is a block diagram showing a configuration when the sentence unit retrieval method of the present invention is implemented by a sentence unit retrieval apparatus.
符号の説明  Explanation of symbols
[0071] 1 文単位検索装置 [0071] Single sentence unit search device
11 CPU  11 CPU
13 記憶手段  13 Memory means
15 通信手段  15 Communication means
16 文書集合接続手段  16 Document set connection method
17 補助記憶手段  17 Auxiliary storage means
18 可搬型記録媒体  18 Portable recording media
1P 制御プログラム  1P control program
2 文書記憶手段  2 Document storage means
4 受付装置  4 Reception device
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0072] 以下本発明をその実施の形態を示す図面に基づき具体的に説明する。 Hereinafter, the present invention will be specifically described with reference to the drawings illustrating embodiments thereof.
[0073] 図 1は、本発明に係る文単位検索方法の概要を示す説明図である。図 1中の 100 は、複数の文書データが記憶されている文書集合を表わしており、文書集合 100か ら取得される一の文書 101は、一又は複数の文からなる文単位 S , · ··, S , S ,… FIG. 1 is an explanatory diagram showing an outline of the sentence unit search method according to the present invention. 100 in FIG. 1 represents a document set in which a plurality of document data is stored, and one document 101 obtained from the document set 100 is a sentence unit S 1,. ..., S, S, ...
1 i i+1 で構成されている。文単位 S , · ··, S , S ,…は、文書 101の先頭力も順に文脈の  1 i i + 1. Sentence units S 1,..., S 1, S 2,.
1 i i+1  1 i i + 1
流れに沿い、時系列的に変遷する意味合いを有して連なっている。図 1中の 200は、 ユーザ Aとユーザ Bとの会話を表わしている。ユーザ Aとユーザ Bの会話 200は、上 力 下へ時系列に連なるユーザ A及びユーザ Bからの発話 U , · ··, Uの集合であ ト 3 j Along the flow, it has a meaning that changes in time series. 200 in FIG. 1 represents a conversation between user A and user B. The conversation 200 between user A and user B is A set of utterances U, ..., U from time-series users A and B
る。会話は、発話 U , U , U , Uの順になされている。なお、ユーザ Aとユーザ B ト 3 r2 Γΐ j  The Conversations are made in the order of utterances U, U, U, U. User A and User B 3 r2 Γΐ j
とを区別せずに単に連続する発話の集合として捉えてもよい。  May be regarded as a set of continuous utterances without distinction.
[0074] 本発明に係る文単位検索方法は、文単位又は言葉をユーザが筆記又は発話した 時点での各単語への注目度合!、を定量的な重み値として表わして各単語に付与し、 時系列に連続する文単位又は言葉毎に変遷していく各単語への注目度合いを反映 した重み付き単語群を各文単位における文脈上の意味合いを表わす指標として用 いることにより、同様の文脈上の意味合いを有する文単位を直接的に検索し、出力 することを目的としている。  [0074] The sentence unit search method according to the present invention provides a degree of attention to each word at the time when the user writes or utters the sentence unit or the word as a quantitative weight value, and assigns it to each word, By using weighted word groups that reflect the degree of attention to each successive word unit in time series or each word that changes from word to word as an index representing contextual meaning in each sentence unit, The purpose is to directly search and output sentence units having the meaning of.
[0075] 図 1の説明図で示す例での会話 200は、ユーザ Aとユーザ Bとの間でなされている 京都への旅行についての会話である。会話 200中の発話 U では「京都」「旅行」が ト 3  [0075] Conversation 200 in the example shown in the explanatory diagram of FIG. 1 is a conversation about travel to Kyoto between user A and user B. Utterance U 200 in conversation 200 is “Kyoto” and “Travel”
現れ、文脈の流れは「京都の旅行」である。発話 u では、「京都」、「旅行」は現れて いないが「"京都への旅行の"時期」についての発話であり、「京都」「旅行」「時期」に つ!ヽて注目がされて!/、る。 U では「暑!/、」が現れて!/、る。 U では「京都」、「旅行」は  The contextual flow that appears is “Travel in Kyoto”. In utterance u, “Kyoto” and “Travel” do not appear, but it is an utterance about “Time of travel to Kyoto”, “Kyoto”, “Travel” and “Time”! There is a lot of attention! In U, "Hot! /," Appears! / In U, “Kyoto” and “Travel”
H H  H H
現れていないが、「"京都は"暑い」のであり、依然「京都」は文脈上の意味に対して重 みを持っている。さらにユーザ Αとユーザ Βとの間では、 U の発話の時点では、「旅  Although not appearing, "" Kyoto is "hot" "and still" Kyoto "has a weight on contextual meaning. Furthermore, between user Α and user Β, at the time of U's utterance,
H  H
行」よりも「京都」及び「時期」が注目されており、ユーザ Aとユーザ Bとは文脈上の意 味合いが変遷していることを共通して認識できるはずである。さらに、発話 Uの中で「 有名」「祭」が現れている。この Uの発話の時点だけを考えれば、「京都」「旅行」「時 期」「暑い」という単語は現れていない。しかし、少なくともユーザ Aにとつては、発話 U は文脈上「夏」の「京都」の「祭」につ 、ての意味合 、を有して!/、る。したがって、発話 Uの時点でも、依然として「京都」は文脈上の意味合いに対して重みを持っている。 なお、発話 Uを発したユーザ Aは少なくとも、祭に相当する単語として「祇園祭」など を想起して 、るはずである。  “Kyoto” and “time” are attracting more attention than “line”, and user A and user B should be able to recognize in common that the contextual implications are changing. Furthermore, “Famous” and “Festival” appear in Utterance U. Considering only the time of U's utterance, the words “Kyoto”, “Travel”, “Time” and “Hot” do not appear. However, at least for user A, utterance U has the meaning of “festival” in “Kyoto” in the “summer” context! Therefore, even at the time of utterance U, “Kyoto” still has weight on contextual implications. It should be noted that user A who utters utterance U should at least recall “Gion Festival” as a word corresponding to the festival.
[0076] これに対し、文書集合 100中の文書 101には京都の旅行記が記されている。その 中の文単位 Sは、「7月」の「京都」といえば「祇園祭」という意味合いを有している。即 ち、文単位 Sは、『「夏」の「7月」の「京都」の「祭」といえば』、「祇園祭」であるという意 味合いを有している。つまり、発話 Uと、文単位 Sとは、共通して「夏」「京都」「祭」に 重みを有しており、文脈上の意味合いが類似している。このように、本発明に係る文 単位検索方法では、発話 Uの際にユーザが意識している、先行の発話からの文脈 上の意味のまとまりを推定し、類似する文脈上の意味合いを有する文単位 sを直接 In contrast, document 101 in document set 100 contains a travel note of Kyoto. Sentence unit S in this context has the meaning of “Gion Festival” when it comes to “Kyoto” in July. In other words, the sentence unit S has the meaning that it is “Gion Festival” or “Gion Festival” in “Summer”, “July”, “Kyoto”. In other words, utterance U and sentence unit S are common to “summer”, “Kyoto” and “festival”. Has weights and similar contextual implications. As described above, in the sentence unit search method according to the present invention, a sentence having a similar contextual meaning is estimated by estimating a group of contextual meanings from the preceding utterance that the user is aware of during the utterance U. Unit s directly
k 的に検索して出力することを目的としている。  The purpose is to search and output k-wise.
[0077] 本発明に係る文単位検索方法を実施するコンピュータシステムを実現した場合、連 続する発話を受け付け、それらの言葉の文脈上の意味と類似する文単位を文書集 合力も抽出するのみならず、ユーザ Aとユーザ Bとの会話中に、コンピュータシステム が発話毎に関連する情報を提示して会話に参入する鼎談が可能になる。また、コン ピュータシステムがユーザ Aとユーザ Bとの会話を支援することも可能になる。図 1の 説明図の例で、会話 100のユーザ Aによる発話 Ujの次に、コンピュータシステムによ つて「7月の京都といえば祇園祭です。」等の音声の出力がされた場合は、ユーザ A とユーザ Bとコンピュータシステムとの間での鼎談が実現することになる。また、ユーザ Aとユーザ Bとの会話が続かなくなった場合に、コンピュータシステムによって「7月の 京都といえば祇園祭」等の情報の提示がされることで、ユーザ Aとユーザ Bとの会話 への支援も実現する。  [0077] When a computer system that implements the sentence unit search method according to the present invention is realized, it is only necessary to accept continuous utterances and extract document unity of sentence units similar to the contextual meaning of those words. First, during a conversation between user A and user B, the computer system can present a relevant information for each utterance and enter into the conversation. In addition, the computer system can support the conversation between user A and user B. In the example shown in Fig. 1, if the computer system outputs an audio message such as “Gion Festival in Kyoto in July” after utterance Uj by user A in conversation 100, user A And talk between User B and the computer system. In addition, when the conversation between user A and user B does not continue, information such as “Gion Festival for Kyoto in July” is presented by the computer system, so that the conversation between user A and user B Support is also realized.
[0078] そこで、このような文脈上の意味が類似する文単位を文書集合力 検索することを 実現するために、本発明に係る文単位検索方法をコンピュータ装置に実施させる。こ の場合、コンピュータ装置には、予め文書集合の文書データを夫々文単位に分別し ておく処理、及び分別した文単位に各文単位の文脈上の意味を表わす定量的な情 報を記憶させておく処理を含む事前処理が必要になる。さらに、コンピュータ装置が 発話を受け付けた場合、その発話の会話の流れ上の意味を表わす定量的な情報を 求める処理、及び、発話に対して求めた情報に基づいて意味が類似する文単位を抽 出して検索結果として出力する処理を含む検索処理が必要になる。  [0078] Therefore, in order to realize a document collective search for such sentence units having similar contextual meanings, the computer system is made to execute the sentence unit search method according to the present invention. In this case, the computer device stores the document data of the document set in advance in units of sentences, and stores quantitative information representing the contextual meaning of each sentence unit in the divided sentence units. Pre-processing including processing to be prepared is required. In addition, when the computer device accepts an utterance, processing for obtaining quantitative information representing the meaning of the utterance in the conversation flow, and sentence units having similar meanings based on the information obtained for the utterance are extracted. A search process including a process of outputting and outputting as a search result is required.
[0079] したがって、以下に説明する実施の形態 1乃至 3では、本発明に係る文単位検索 方法をコンピュータ装置に実施させるために必要なハードウェア構成についてまず説 明する。さらにコンピュータ装置による処理を、事前処理と検索処理とを区別して段 階的に説明する。具体的には、各実施の形態において、  Therefore, in Embodiments 1 to 3 described below, a hardware configuration necessary for causing a computer device to execute the sentence unit search method according to the present invention will be described first. Furthermore, the processing by the computer apparatus will be explained step by step by distinguishing the preprocessing and the search processing. Specifically, in each embodiment,
「1.ハードウェアの構成及びシステムの概要」、 事前処理として "1. Hardware configuration and system overview", As pre-processing
「2.文書データの取得及び自然言語解析」、及び  “2. Acquisition of document data and natural language analysis”, and
「3.文書データの文毎の意味のまとまりの定量化」、  “3. Quantification of the meaning of each sentence of document data”,
次に  next
「4.検索処理」  4. Search process
の順に説明する。  Will be described in the order.
[0080] なお、以下に説明する実施の形態 1乃至 3では、本発明に係る文単位検索方法を 実施する例として、文書データの文書集合を記憶しておくハードウェアと、発話を受 け付けるコンピュータ装置と、文書集合が記憶されたノヽードウエア及び発話を受け付 けるコンピュータ装置に接続して検索処理を実行するコンピュータ装置とで構成され る検索システムを挙げて説明する。  In Embodiments 1 to 3 described below, as an example of executing the sentence unit search method according to the present invention, hardware that stores a document set of document data and an utterance are accepted. A description will be given of a search system that includes a computer device and a computer device that executes a search process by connecting to a computer device that accepts utterances and nodeware that stores a document set.
[0081] また、以下に示す例では主に、文書集合が日本語の自然文力もなる場合について 各処理、具体例を示している。しカゝしながら、本発明の文単位検索方法は、日本語の みならず、他の言語にも適用することができることは勿論である。この場合、言語解析 (形態素解析、統語解析)等の言語毎に特有の文法上の取り扱い等は、その言語毎 に最適な方法を用いる。  [0081] Further, in the example shown below, each process and specific example are mainly shown in the case where the document set also has Japanese natural sentence power. However, it goes without saying that the sentence unit search method of the present invention can be applied not only to Japanese but also to other languages. In this case, the grammatical handling specific to each language, such as language analysis (morphological analysis, syntactic analysis), etc., uses the most appropriate method for each language.
[0082] (実施の形態 1)  [0082] (Embodiment 1)
1.ハードウェアの構成及びシステムの概要  1. Hardware configuration and system overview
図 2は、実施の形態 1における文単位検索装置 1を用 V、た検索システムの構成を示 すブロック図である。検索システムは、文書データからの検索処理を実行する文単位 検索装置 1と、自然言語からなる文書データを記憶する文書記憶手段 2と、インター ネット等のパケット交換網 3と、ユーザ力 入力されるキーワード又は音声等の言葉を 受け付ける受付装置 4, 4,…とで構成される。文単位検索装置 1は、 PC (Personal FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search apparatus 1 according to the first embodiment. The retrieval system includes a sentence unit retrieval device 1 that executes retrieval processing from document data, a document storage unit 2 that stores document data in natural language, a packet switching network 3 such as the Internet, and a user input. Consists of accepting devices 4, 4,... That accept keywords or words such as speech. The sentence unit search device 1 is PC (Personal
Computer)であり、自然言語からなる文書データを記憶する文書記憶手段 2と接 続される。また、受付装置 4, 4,…も PCであり、文単位検索装置 1は、パケット交換網 3を介して受付装置 4, 4,…と接続され通信が可能である。 Computer) and is connected to document storage means 2 for storing document data composed of natural language. Also, the accepting devices 4, 4,... Are also PCs, and the sentence unit retrieval device 1 is connected to the accepting devices 4, 4,.
[0083] 実施の形態 1の検索システムでは、文単位検索装置 1は、検索の対象である文単 位を含む文書データを文書記憶手段 2に予め記憶しておく。文単位検索装置 1は、 文書記憶手段 2に記憶した文書データを、予め文単位に分別し、後に検索処理が可 能なように各文単位に文脈上の意味を表わす定量的な情報を記憶させておく。また 、受付装置 4, 4,…は、受け付けた言葉をコンピュータで処理可能なテキストデータ 又は音声データに変換し、パケット交換網 3を介して当該データを文単位検索装置 1 へ送信する。文単位検索装置 1が、受信した言葉のデータに基づいて文書記憶手段 2に記憶した文書データから一又は複数の文力 なる文単位を抽出し、抽出した文 単位をパケット交換網 3を介して受付装置 4, 4, …へ出力することで文単位の検索を 実現する。 In the search system of the first embodiment, the sentence unit search apparatus 1 stores document data including a sentence unit to be searched in the document storage unit 2 in advance. The sentence unit search device 1 The document data stored in the document storage means 2 is classified in advance into sentence units, and quantitative information representing contextual meaning is stored in each sentence unit so that it can be searched later. Further, the receiving devices 4, 4,... Convert the received words into text data or voice data that can be processed by a computer, and transmit the data to the sentence unit searching device 1 via the packet switching network 3. The sentence unit retrieval device 1 extracts one or more sentence units having sentence power from the document data stored in the document storage means 2 based on the received word data, and the extracted sentence units are transmitted via the packet switching network 3. A sentence-by-sentence search is realized by outputting to the receiving devices 4, 4,.
[0084] 文単位検索装置 1は、少なくとも、各種ノヽードウエアを制御する CPU 11と、各種ノヽ 一ドウエア間を接続する内部バス 12と、不揮発性のメモリからなる記憶手段 13と、揮 発性のメモリからなる一時記憶領域 14と、パケット交換網 3と接続するための通信手 段 15と、文書記憶手段 2と接続するための文書集合接続手段 16と、 DVD、 CD-R OM等の可搬型記録媒体 18を用いる補助記憶手段 17とを備える。  [0084] The sentence unit search device 1 includes at least a CPU 11 for controlling various kinds of hardware, an internal bus 12 for connecting various kinds of hardware, a storage means 13 including a nonvolatile memory, and a volatile type. Temporary storage area 14 consisting of memory, communication means 15 for connection to the packet switching network 3, document set connection means 16 for connection to the document storage means 2, and portable types such as DVDs and CD-ROMs And auxiliary storage means 17 using the recording medium 18.
[0085] 記憶手段 13には、 DVD、 CD— ROM等の可搬型記録媒体 18から取得した、 PC が本発明に係る文単位検索装置 1として動作するための制御プログラム IPが記憶さ れている。 CPU11は、制御プログラム 1Pを記憶手段 13から読み出して実行すると 共に、内部バス 12を介して各種ノヽードウエアを制御する。一時記憶領域 14は、 CPU 11の演算処理によって一時的に発生する情報が記憶される。  [0085] The storage means 13 stores a control program IP acquired from a portable recording medium 18 such as a DVD or CD-ROM for the PC to operate as the sentence unit search device 1 according to the present invention. . The CPU 11 reads out and executes the control program 1P from the storage means 13, and controls various kinds of nodeware via the internal bus 12. The temporary storage area 14 stores information temporarily generated by the arithmetic processing of the CPU 11.
[0086] CPU11は、受付装置 4, 4,…から送信される言葉のデータを通信手段 15を介し て受信したことを検知し、受信した言葉のデータに基づいて処理を実行し、検索処理 を行う。また、 CPU 11は、文書集合接続手段 16を介して文書記憶手段 2で記憶して いる文書データを取得し、且つ、文書集合接続手段 16を介して文書データを文書記 憶手段 2に記憶させることが可能である。  [0086] The CPU 11 detects that the word data transmitted from the accepting devices 4, 4,... Is received via the communication means 15, executes processing based on the received word data, and performs search processing. Do. Further, the CPU 11 acquires the document data stored in the document storage unit 2 through the document set connection unit 16 and stores the document data in the document storage unit 2 through the document set connection unit 16. It is possible.
[0087] DVD、 CD— ROM等の可搬型記録媒体 18から補助記憶手段 17を介して取得し た、記憶手段 13に記憶されている制御プログラム 1Pでは更に、記憶手段 13で記憶 して!ヽる辞書情報に基づ!ヽて文字列で表された文書データを形態素解析及び統語 解析等の自然言語解析を CPU 11〖こ実行させることができるようにしてある。  [0087] The control program 1P stored in the storage means 13 obtained from the portable recording medium 18 such as a DVD or CD-ROM via the auxiliary storage means 17 is further stored in the storage means 13! Based on the dictionary information, it is possible to execute natural language analysis such as morphological analysis and syntactic analysis on document data expressed in character strings.
[0088] 受付装置 4, 4,…は、少なくとも、各種ノ、一ドウエアを制御する CPU41と、各種ノ、 一ドウエア間を接続する内部バス 42と、不揮発性メモリからなる記憶手段 43と、揮発 性メモリからなる一時記憶領域 44と、マウス又はキーボード等の操作手段 45と、モ- タ等の表示手段 46と、マイク及びスピーカ等の音声入出力手段 47と、パケット交換 網 3へ接続するための通信手段 48とを備える。 [0088] The accepting devices 4, 4,... Include at least a CPU 41 for controlling various types of software, various types of software, Internal bus 42 for connecting the software, storage means 43 composed of nonvolatile memory, temporary storage area 44 composed of volatile memory, operation means 45 such as a mouse or keyboard, and display means 46 such as a motor 46 Voice input / output means 47 such as a microphone and a speaker, and communication means 48 for connection to the packet switching network 3.
[0089] 記憶手段 43には、 PCが受付装置 4, 4,…として動作するための処理プログラム等 が記憶されている。 CPU41は、処理プログラムを記憶手段 43から読み出して実行す ると共〖こ、内部バス 42を介して各種ノヽードウエアを制御する。一時記憶領域 44は、 C PU41の演算処理によって一時的に発生する情報が記憶される。  [0089] The storage means 43 stores a processing program for the PC to operate as the accepting devices 4, 4,. When the CPU 41 reads the processing program from the storage means 43 and executes it, the CPU 41 controls various nodewares via the internal bus 42. In the temporary storage area 44, information temporarily generated by the arithmetic processing of the CPU 41 is stored.
[0090] CPU41は、ユーザからの文字列入力操作を操作手段 45を介して検知し、入力さ れた文字列を一時記憶領域 44に記憶することができる。 CPU41は、ユーザから入 力された音声を音声入出力手段 47を介して検知し、記憶手段 43に記憶された音声 認識のためのプログラムを読み出して実行することによって入力された音声をテキス トデータに変換することができる。また、 CPU41は、ユーザ力も入力された音声を音 声入出力手段 47により、コンピュータで処理可能な音声データとして入力することが できる。  The CPU 41 can detect a character string input operation from the user via the operation means 45 and store the input character string in the temporary storage area 44. The CPU 41 detects the voice input from the user via the voice input / output means 47, reads the voice recognition program stored in the storage means 43, and executes it as text data. Can be converted. Further, the CPU 41 can input the voice inputted by the user as voice data that can be processed by a computer through the voice input / output means 47.
[0091] また、 CPU41は、ユーザからの文字列入力操作又は音声入力を検知することで得 られたテキスト又は音声の言葉のデータを通信手段 48を介して文単位検索装置 1へ 送信する。  In addition, the CPU 41 transmits text or voice word data obtained by detecting a character string input operation or voice input from the user to the sentence unit search device 1 via the communication means 48.
[0092] なお、 CPU41は、音声データをテキストデータに変換して送信してもよぐその場 合は、 CPU41は、音声認識によって得られる音声データの特徴、例えば各単語に 相当する音素が発声された時の速度、単語に相当する音素の周波数等のデータを 共に送信してもよい。また、 CPU41は、各単語に相当する音声データ間の時間差に ついても記憶しておき、以前に受け付けた言葉にその単語が含まれていた時点との 時間差も共に文単位検索装置 1へ送信してもよい。  [0092] Note that in the case where the CPU 41 may convert voice data into text data and transmit it, the CPU 41 utters features of voice data obtained by voice recognition, for example, phonemes corresponding to each word. You may also send data such as the speed at the time of being sent and the frequency of the phoneme corresponding to the word. The CPU 41 also stores the time difference between the speech data corresponding to each word, and sends the time difference from the point in time when the word was included in the previously accepted word to the sentence unit search device 1. May be.
[0093] 2.文書データの取得及び自然言語解析  [0093] 2. Document data acquisition and natural language analysis
上述のように構成される検索システムにおいて、文単位検索装置 1はまず、事前処 理として文書集合を用意して、後に各文書データに含まれる文単位毎の意味のまと まりを表わすことができるようにしておく処理を行なう。「2.文書データの取得及び自 然言語解析」では、文単位検索装置 1が文書記憶手段 2に文書データを記憶してお き、各文書データを言語解析して一又は複数の文力もなる文単位に分別し、さらに文 単位毎に文法的な特徴を解析し、文書記憶手段 2に文単位毎に記憶しておく処理に ついて説明する。なお、実施の形態 1では、文単位検索装置 1は文単位を一の文とし た場合について説明する。 In the search system configured as described above, the sentence unit search apparatus 1 first prepares a document set as pre-processing, and later represents a group of meanings for each sentence unit included in each document data. Process to make it possible. "2. Document data acquisition and In `` Language analysis '', the sentence unit search device 1 stores the document data in the document storage means 2, parses each document data into sentence units that have one or more sentences, The process of analyzing grammatical characteristics for each sentence and storing them in the document storage means 2 for each sentence unit will be described. In the first embodiment, a description will be given of a case where the sentence unit search device 1 uses one sentence as one sentence.
[0094] 文単位検索装置 1の CPU11は、検索の対象である文単位を含む文書データを文 書記憶手段 2に予め記憶しておく。文単位検索装置 1の CPU11は、通信手段 15及 びパケット交換網 3を介して取得可能な文書データを Webクローリングにより取得し、 文書集合接続手段 16を介して文書記憶手段 2に記憶する。文単位検索装置 1の CP U 11は、取得して文書集合接続手段 16を介して文書記憶手段 2に記憶してある文 書データを文単位に分別し、夫々言語解析 (形態素解析及び統語解析)を行い、そ の結果を文単位毎に対応付けて記憶する処理を行なう。  The CPU 11 of the sentence unit search device 1 stores document data including the sentence unit to be searched in the document storage unit 2 in advance. The CPU 11 of the sentence unit search device 1 acquires the document data that can be acquired via the communication unit 15 and the packet switching network 3 by Web crawling, and stores it in the document storage unit 2 via the document set connection unit 16. The CPU 11 of the sentence unit search device 1 classifies the document data acquired and stored in the document storage means 2 via the document set connection means 16 into sentence units, and performs language analysis (morphological analysis and syntactic analysis). ) And store the result in association with each sentence unit.
[0095] 以下に、文単位検索装置 1の CPU11が、文書データを取得し、取得した文書デー タに対して形態素解析及び統語解析の自然言語解析をして、文単位毎に記憶する 処理手順について説明する。図 3は、実施の形態 1における文単位検索装置 1の CP U11が、取得した文書データに対する形態素解析及び統語解析処理の解析結果か らタグ付け及び単語抽出を行 、記憶する処理手順を示すフローチャートである。図 3 のフローチャートに示す処理は、文単位毎にその文単位に出現する単語又は先行 の文単位カゝら参照する単語を抽出する処理と、各文単位における各単語の特徴を特 定して記憶しておく処理に対応する。  [0095] Hereinafter, the CPU 11 of the sentence unit search apparatus 1 acquires document data, performs natural language analysis of morphological analysis and syntactic analysis on the acquired document data, and stores it for each sentence unit. Will be described. FIG. 3 is a flowchart showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment performs tagging and word extraction from the analysis results of the morphological analysis and syntactic analysis processing for the acquired document data and stores them. It is. The processing shown in the flowchart of FIG. 3 is performed by extracting a word that appears in each sentence unit or a word that is referred to from the preceding sentence unit and a feature of each word in each sentence unit. This corresponds to the processing to be stored.
[0096] CPU11は、 Webクローリングを開始すると文書データを取得した力否か判断する( ステップ SI 1)。 CPU11が文書データを取得して!/、な!/、と判断した場合は(SI 1 :N 0)、 CPU11は処理をステップ S11へ戻し、文書データを取得するまで待機する。 C PU11が文書データを取得したと判断した場合は(S11 :YES)、 CPU11は、取得し た文書データから一文毎の読み出しを試み、読み出しが成功したか否かを判断する (ステップ S 12)。 [0096] When starting Web crawling, CPU 11 determines whether or not it has acquired document data (step SI 1). If the CPU 11 obtains the document data and determines that! /, N! / (SI 1: N 0), the CPU 11 returns the process to step S11 and waits until the document data is obtained. When CPU 11 determines that the document data has been acquired (S11: YES), CPU 11 attempts to read each sentence from the acquired document data and determines whether the reading has succeeded (step S12). .
[0097] CPU11が、読み出し箇所が文書データの終端に至っておらず、文の読み出しが 成功したと判断した場合は (S12 :YES)、読み出した文の形態素解析及び統語解析 を行う(ステップ SI 3)。 [0097] If the CPU 11 determines that the reading has not reached the end of the document data and the reading of the sentence has succeeded (S12: YES), the morphological analysis and syntactic analysis of the read sentence (Step SI 3).
[0098] CPU11は、形態素解析及び統語解析の結果から、解析した文に出現する単語及 び当該文で先行の文から参照する単語を抽出し、リストに記憶する (ステップ S14)。 更に、 CPU11は、後述で説明するように解析結果力もタグを生成し (ステップ S 15)、 読み出した文にタグを付加して、文書集合接続手段 16を介して文書記憶手段 2に記 憶させる (ステップ S 16)。  The CPU 11 extracts words that appear in the analyzed sentence and words that are referred to from the preceding sentence in the sentence from the results of the morphological analysis and syntactic analysis, and stores them in the list (step S14). Further, as will be described later, the CPU 11 also generates a tag for the analysis result power (step S15), adds the tag to the read sentence, and stores it in the document storage means 2 via the document set connection means 16. (Step S16).
[0099] 一方、 CPU11が、読み出し箇所が文書データの終端に至っており、文の読み出し が失敗したと判断した場合は(S12 :NO)、取得した文書データに対する処理を終了 する。  On the other hand, when the CPU 11 determines that the reading portion has reached the end of the document data and the reading of the sentence has failed (S12: NO), the processing for the acquired document data is terminated.
[0100] 上述の処理を、文書データを取得する都度に行!、、タグ付け済みの文書データを 文書記憶手段 2に記憶しておく。  The above processing is performed every time document data is acquired !, and the tagged document data is stored in the document storage means 2.
[0101] 次に、文単位検索装置 1の CPU11による上述の処理の詳細を、具体例を挙げて 説明する。 Next, details of the above-described processing by the CPU 11 of the sentence unit search device 1 will be described with a specific example.
[0102] 図 4は、実施の形態 1における文書記憶手段 2で記憶される文書データの内容の一 例を示す説明図である。文書記憶手段 2で記憶される文書データは、文単位検索装 置 1の CPU11が通信手段 15を介して、パケット交換網 3に接続され公開されている Webサーバから取得された HTML (HyperText Markup Language)等のテキ ストデータをもとに記憶される。図 4に示す一例も、インターネットで公開された Web ページ(http:〃 ja.wikipedia.org/wiki/祭より抜粋)より取得することができた HTMLデ ータの文書である。以下、この文書例を使用して文書の解析及び検索等について説 明する。  FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means 2 in the first embodiment. The document data stored in the document storage means 2 is stored in the HTML (HyperText Markup Language) obtained from a publicly accessible Web server connected to the packet switching network 3 via the communication means 15 by the CPU 11 of the sentence unit search apparatus 1. ) And other text data. The example shown in Fig. 4 is also a document of HTML data that can be obtained from a web page published on the Internet (http: 〃ja.wikipedia.org / wiki / excerpt). In the following, this document example will be used to explain document analysis and retrieval.
[0103] 文単位検索装置 1の CPU11は、図 3のフローチャートに示したステップ S12の文の 読み出しの処理において、取得した文書データ中の文字列を「文」の言語単位 (文単 位)に分別する。分別する方法として例えば、 CPU11は、日本語からなる文書デー タである場合、句点「。」を表す文字列によって、又は、英語からなる文書データであ る場合はピリオド「.」を表す文字列によって分別してもよい。  [0103] The CPU 11 of the sentence unit search device 1 converts the character string in the acquired document data into the sentence unit language unit (sentence unit) in the sentence reading process in step S12 shown in the flowchart of FIG. Sort. For example, when the document data is composed of Japanese, the CPU 11 uses a character string representing a punctuation mark “.” Or a character string representing a period “.” If the document data is composed of English. You may sort by.
[0104] 次に、図 3のフローチャートに示した文単位検索装置 1の CPU11によるステップ S1 3の形態素解析及び統語解析の処理の詳細を説明する。 [0105] 文単位検索装置 1の CPU11は、「文」の言語単位に対して辞書情報に基づいた形 態素解析を行 ヽ、文の最小構成単位である形態素を同定して形態素の構造を解析 する。例えば、図 4に示した文書データでは、 CPU11は、記憶手段 13の辞書情報に 基づいて、「祭」「神霊」等の名詞、「九州」等の固有名詞、「祀る」等の動詞、「と」「は」 等の助詞、「、」「。」等の記号等を示す文字列と照合することで形態素を同定する。形 態素解析の手法については今日では種々の手法が提案されており、本発明では当 該形態素解析の手法を限定するものではな 、。 Next, details of the morphological analysis and syntactic analysis processing in step S13 performed by the CPU 11 of the sentence unit searching apparatus 1 shown in the flowchart of FIG. 3 will be described. [0105] CPU 11 of sentence unit search device 1 performs morphological analysis based on dictionary information for the linguistic unit of "sentence", identifies the morpheme that is the minimum constituent unit of the sentence, and determines the morpheme structure. To analyze. For example, in the document data shown in FIG. 4, based on the dictionary information stored in the storage means 13, the CPU 11 uses a noun such as “Festival” and “God Spirit”, a proper noun such as “Kyushu”, a verb such as “speak”, “ A morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”. Various techniques for morphological analysis have been proposed today, and the present invention does not limit the morphological analysis techniques.
[0106] さらに、文単位検索装置 1の CPU11は、同定した形態素毎にその品詞情報 (名詞 、助詞、形容詞、動詞、副詞等)と、 日本語文である場合は日本語の文法、英文であ る場合は英語の文法に基づく品詞間の結束性を統計的に求めた文法情報とに基づ いて形態素間の文法的関係を抽出する統語解析を行う。例えば、文法を木構造に当 てはめて形態素の品詞情報力 木構造に従って形態素間の関係を抽出することが できる。解析対象が (形容詞 +名詞 +助詞 +名詞)である場合、まず解析対象が名 詞であるか否かを判断する。名詞でないと判断した場合は次に、当該解析対象が( 形容詞 +名詞)に当てはまるか否かを判断する。したがって、当該解析対象の先頭 の形態素が形容詞句であるカゝ否かを判断する。先頭の形態素が形容詞であると判断 した場合は、当該形容詞が後続する名詞を修飾する当該解析対象の中で一番大き な修飾語であると判断される。つまり(形容詞 + (名詞))という関係が抽出される。  [0106] Furthermore, the CPU 11 of the sentence unit search device 1 uses the part of speech information (nouns, particles, adjectives, verbs, adverbs, etc.) for each identified morpheme, and Japanese grammar and English if it is a Japanese sentence. When syntactic analysis is performed, syntactic analysis is performed to extract grammatical relationships between morphemes based on grammatical information that statistically obtains cohesiveness between parts of speech based on English grammar. For example, by applying a grammar to a tree structure, it is possible to extract the relationship between morphemes according to the tree structure. When the analysis target is (adjective + noun + particle + noun), it is first determined whether or not the analysis target is a noun. If it is determined that it is not a noun, it is next determined whether or not the subject of analysis applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective phrase. When it is determined that the first morpheme is an adjective, it is determined that the adjective is the largest modifier in the analysis target that modifies the noun that follows. In other words, the relationship (adjective + (noun)) is extracted.
[0107] 次に、残りの解析対象が (名詞)であるか否かを判断する。複数の形態素からなり、 名詞ではないと判断した場合は、当該残りの解析対象が(形容詞 +名詞)に当ては まるか否かを判断する。したがって、残りの解析対象の先頭の形態素が形容詞である か否かを判断する。残りの解析対象の先頭の形態素が形容詞でな!、と判断した場合 は、(形容詞 +名詞)の形容詞の部分を (名詞 +助詞)に展開し、残りの解析対象が( (名詞 +助詞) +名詞)に当てはまるか否かを判断する。残りの解析対象が((名詞 + 助詞) +名詞)に当てはまると判断した場合は、当該解析対象 (形容詞 +名詞 +助詞 +名詞)の形態素間の文法的関係は [形容詞 + { (名詞 +助詞) +名詞 }]であると抽 出することができる。統語解析の方法にっ 、てもこのような方法を基礎とする手法に 限らず、形態素解析の手法同様に今日では種々の手法が提案されており本発明で は当該統語解析の手法を限定するものではない。 Next, it is determined whether or not the remaining analysis target is (noun). If it is determined that it consists of multiple morphemes and is not a noun, it is determined whether or not the remaining analysis target applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective. The first morpheme to be analyzed is an adjective! , The adjective part of (adjective + noun) is expanded to (noun + particle), and it is determined whether or not the remaining analysis target applies to ((noun + particle) + noun). If it is determined that the remaining analysis target is ((noun + particle) + noun), the grammatical relationship between the morphemes of the analysis target (adjective + noun + particle + noun) is [adjective + {(noun + particle ) + Noun}]. The method of syntactic analysis is not limited to the method based on such a method, but various methods are proposed today as well as the method of morphological analysis. Does not limit the method of syntactic analysis.
[0108] 実施の形態 1では、一例として形態素解析及び統語解析について chaSen(http:〃 chasen.org)及び CaboCha (ェ藤 拓、松本 裕治「チャンキングの段階適用による 日本語係り受け解析」情報処理学会論文誌 Vol. 6、 No. 43、 pp. 1834— 1842 (2 002)、 http:〃 chasen.org/〜taku/software/cabocha参照)にて開示された技術に基 づいて行う。他に KNP (Kurohashi— Nagao Parser) (黒橋 禎夫、長尾 眞「並 列構造の検出に基づく長い日本語文の構造解析」自然言語処理 Vol. l、No. l、p p. 35- 57 (1994) )で開示されて 、る技術に基づ 、て解析するのでもよ!/、。 [0108] In the first embodiment, for example, morphological analysis and syntactic analysis are performed by ch aS en (http: 〃 chasen.org) and CaboCha (Taku Eto, Yuji Matsumoto "Japanese dependency analysis by applying chunking stage". IPSJ Journal Vol. 6, No. 43, pp. 1834— 1842 (2 002), http: 〃 chasen.org/~taku/software/cabocha)). KNP (Kurohashi— Nagao Parser) (Kurohashi Ikuo, Nagao Iwao “Structural analysis of long Japanese sentences based on parallel structure detection” Natural language processing Vol. L, No. l, p p. 35-57 (1994 ) It is possible to analyze based on the technology disclosed in)! /.
[0109] 文単位検索装置 1の CPU11は、解析した形態素及び形態素間の文法的関係を、 XML (extensible Markup Language)に基づくタグで表した文書データを生成 して文書記憶手段 2に記憶させる。本発明が利用する形態素解析及び統語解析の 自然言語解析方法 (chaser CaboCha)では入力された文字列を形態素解析し、さ らに統語解析して各形態素の品詞情報、形態素の係り先を示す情報等を分別した 形態素毎に出力するようにしてある。文単位検索装置 1の記憶手段 13に記憶されて V、る制御プログラム 1Pでは、当該自然言語解析方法を文単位検索装置 1の CPU 11 に実行させることができるように構成されて 、る。  The CPU 11 of the sentence unit search device 1 generates document data in which the analyzed morphemes and the grammatical relationships between the morphemes are represented by tags based on XML (extensible Markup Language), and stores them in the document storage means 2. In the natural language analysis method (chaser CaboCha) of morphological analysis and syntactic analysis used by the present invention, the input character string is morphologically analyzed and further syntactically analyzed to indicate the part-of-speech information of each morpheme and the morpheme information And so on, for each morpheme that is classified. The control program 1P stored in the storage means 13 of the sentence unit retrieval apparatus 1 is configured to allow the CPU 11 of the sentence unit retrieval apparatus 1 to execute the natural language analysis method.
[0110] 本発明が利用する形態素解析及び統語解析では、例えば、図 4に示した「九州地 方北部では、秋に行われるものに対して(お)くんちと称する場合もある。」という文の 文字列に対しまず文節番号が付される。(0:九州地方北部では、 Z1 :秋に行われる ものに対して (お)くんちと称する場合も Z2:ある。)さらに各文節で形態素に分別さ れ、形態素毎の品詞情報、形態素の基本形情報、発音情報等が付加される。文節 番号 0の文節は、(0:九州 (名詞 +固有名詞 +地域 +—般、九州、キユウシユウ) Z 地方 (名詞 +—般、地方、チホウ) Z北部 (名詞 +—般、北部、ホタブ) Zで (助詞 + 格助詞 +—般、で、デ) Zは (助詞 +係助詞、は、ハ) z、(記号 +読点))と形態素の 同定及び情報の付加が行われる。「九州」という形態素は名詞であって固有名詞であ り、地域を示す名詞でもあり、一般名詞として使用されることもある。また基本形は「九 州」であり、「キユウシユウ」と発音することを判別することができる。他の文節も同様で ある。また、係り受け情報は例えば、(0 2, 1 2, 2 —1)と文節間の係り受け関係 が判別可能なように取得できる。この例では、文節番号 0の文節は文節番号 2の文節 を係り先とし、文節番号 1の文節は文節番号 2の文節を係り先とすることが判別できる 。また、文節番号 2の文節は係り先がないことを係り先を— 1とすることで判別できる。 [0110] In the morphological analysis and syntactic analysis used by the present invention, for example, the sentence "In the northern part of Kyushu is sometimes referred to as (o) kunchi" in the northern part of Kyushu is shown. First, a phrase number is assigned to the character string. (0: In the northern part of the Kyushu region, Z1: There are also Z2: sometimes called (O) kunchi in the fall.) Furthermore, each phrase is divided into morphemes, part-of-speech information for each morpheme, basic form of the morpheme Information, pronunciation information, etc. are added. The phrase number 0 is (0: Kyushu (noun + proper noun + region + —general, Kyushu, Yuyu) Z region (noun + —general, region, chihou) Z north (noun + —general, region, northern) In Z, (particle + case particle + —general, de) Z is (particle + subject particle, is c) z, (symbol + punctuation)), and morphemes are identified and information is added. The morpheme “Kyushu” is a noun, proper noun, a noun indicating the region, and is sometimes used as a general noun. Moreover, the basic form is “Kyushu”, and it can be determined that the pronunciation is “Kyushuyu”. The same applies to the other clauses. The dependency information is, for example, (0 2, 1 2, 2 —1) and the dependency relationship between phrases Can be obtained so that can be discriminated. In this example, it can be determined that the clause number 0 is the clause number 2 clause, and the clause number 1 clause is the clause number 2 clause. In addition, the phrase number 2 can be identified by having a relationship destination of -1 because there is no dependency destination.
[0111] 図 5は、実施の形態 1における文単位検索装置 1の CPU11が、形態素解析及び統 語解析した結果を付与して文書記憶手段 2に記憶させる文書データの一例を示す説 明図である。図 4に示した内容の文書データに対して図 3のフローチャートに示した 処理手順が実行されたことにより文書記憶手段 2に記憶された文書データの例に相 当する。 FIG. 5 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search apparatus 1 according to Embodiment 1 gives and stores in the document storage unit 2 the results of morphological analysis and syntactic analysis. is there. This corresponds to an example of the document data stored in the document storage means 2 by executing the processing procedure shown in the flowchart of FIG. 3 on the document data having the contents shown in FIG.
[0112] 図 5に示すように、文単位検索装置 1の CPU11により、図 4に示した内容の文書の 一部が固有名詞、名詞、助詞、動詞等の形態素に分別され、形態素間の文法的関 係性はタグの入れ子によって表されている。図 5に示す例は、 GDA (Global Docu ment Annotation; http://i_content.org/gda参照)で提案されて 、る規則に則つ たタグ付け手法に従ったものである。本発明では当該規則に従うことを限定するもの ではない。また、形態素の情報及び形態素間の係り受けの情報をコンピュータが情 報処理によって識別できるようにすることができれば XMLのタグ付けによる方法には 限らない。  [0112] As shown in Fig. 5, the CPU 11 of the sentence unit search apparatus 1 sorts a part of the document shown in Fig. 4 into morphemes such as proper nouns, nouns, particles, verbs, etc. Relevance is expressed by nesting tags. The example shown in Fig. 5 is based on the tagging method according to the rules proposed by GDA (Global Document Annotation; see http://i_content.org/gda). The present invention is not limited to complying with the rules. If the computer can identify morpheme information and dependency information between morphemes by information processing, the method is not limited to XML tagging.
[0113] GDAに基づくタグ付けは基本的にくタグ名 属性名 = "属性値" >で表される。図 5に示される例では、く 511>で示されるタグは、文(Sentential unit)を表すタグで ある。図 5に示した例では、「九州地方北部では、秋に行われるものに対して(お)くん ちと称する場合もある。」の文は、「九州地方北部では」「、」「秋に行われるものに対し て(お)くんちと称する場合も」「ある」「。」の三つの文節と句読点との単位を有してい ることがタグによって判別できる。く &(1>で示されるタグは、終助詞以外の助詞 (part icle)、副詞 (adverb)、連体詞などを示すタグである力 文節 0の「九州地方北部で は」も全体で副詞的な役割を果たすことを示すことができる。 <n>で示されるタグは 、名詞 (noun)を示す。 <v>で示されるタグは、動詞 (verb)を示す。また、図 5に示 したタグの他に形容詞(adjective)を示すく aj >タグ等がある。  [0113] Tagging based on GDA is basically represented by tag name attribute name = "attribute value">. In the example shown in FIG. 5, the tag indicated by <511> is a tag representing a sentence (Sentential unit). In the example shown in Fig. 5, the sentence “In the northern part of Kyushu is sometimes referred to as (O) kun in the autumn” is the sentence “in the northern part of Kyushu”, “ It can be identified by the tag that it has a unit of three clauses of “There is” and “. The tag indicated by & (1> is a tag that indicates a particle other than the final particle (part icle), adverb, adjunct, etc. The tag indicated by <n> indicates a noun, the tag indicated by <v> indicates a verb, and the tag shown in FIG. In addition to this, there are aj> tags etc. that indicate an adjective.
[0114] 属性名 synで表される属性は、当該属性が付与されているタグで挟まれた文節又 は語等の言語単位間の係り受け関係を示す。属性値 f (forward;前向き)が付与さ れて 、る文では、当該文を構成する言語単位は一番近 、後続の言語単位に係ること を示す。したがって、原則では文節 0の「九州地方北部では」は、文節 1の「秋に行わ れるものに対して(お)くんちと称する場合も」へ係り、文節 1の「秋に行われるものに 対して(お)くんちと称する場合も」は文節 2の「ある」に係る。 [0114] The attribute represented by the attribute name syn indicates a dependency relationship between language units such as clauses or words sandwiched between tags to which the attribute is assigned. Attribute value f (forward) is assigned This means that the linguistic unit that constitutes the sentence is closest to the subsequent linguistic unit. Therefore, in principle, phrase 0 “in the northern part of the Kyushu region” relates to phrase 1 “when it is called (O) kunchi for what happens in the fall”, The term “te (kun)” refers to “Yes” in clause 2.
[0115] し力し統語解析により、文節 0の「九州地方北部では」は文節 2の「ある」に係り、文 節 1の「秋に行われるものに対して(お)くんちと称する場合も」は文節 2の「ある」に係 ることが判別できているため、上述原則はあてはまらない。したがって、係り受けの受 ける側ではない「句」(phrase)であることを示す" p"を各タグに付加することで、係り 受けの関係を示すことができる。例えば、く adp>で示されるタグは、タグく ad>に、 句であることを示す "P"が組み合わさったものである。 < adp >タグではさまれた文節 は副詞句であって、係り受けの受ける側の文節ではないことを示す。したがって、図 5 に示した例では、文節 1の「秋に行われるものに対して(お)くんちと称する場合も」は 、副詞句であって受ける側の文節ではないため、文節 0の「九州地方北部では」は、 文節 1の「秋に行われるものに対して(お)くんちと称する場合も」へ係らずに「ある」に 係ることが示される。その他、 "p"は「句」であることを明示するために付加される。  [0115] By force and syntactic analysis, “in the northern part of Kyushu” in clause 0 is related to “there” in clause 2; The above principle does not apply, because it can be determined that “” is related to “Yes” in clause 2. Therefore, by adding “p” to each tag indicating “phrase” that is not the side of receiving the dependency, the relationship of the dependency can be shown. For example, the tag indicated by <adp> is a combination of the tag <ad> and "P" indicating a phrase. Indicates that the clause between the <adp> tags is an adverbial phrase and not the clause on which the dependency is received. Therefore, in the example shown in FIG. 5, the phrase 1 “even if it is called (O) kunchi for what happens in the fall” is an adverbial phrase and is not the receiving phrase, so the phrase “ “In the northern part of the Kyushu region” indicates that it is related to “Yes”, regardless of the phrase 1 “In the case of (O) kunchi for what happens in the fall”. In addition, "p" is added to indicate that it is a "phrase".
[0116] また、 <n>で示すタグについても、く np >とすることで係り受けの受ける側の語で はないことを示すことができる。「九州地方北部」は、「九州」「地方」「北部」と夫々く n >で挟まれる形態素に分別でき、「九州」は「地方」に、「地方」は「北部」に係るため" P"は不要である。一方、「催事 (催し、イベント)、フェスティノ レのこと」では、「催事( 催し、イベント)」は「フェスティバル」に係らず「の」に係るため、「フェスティバル」を挟 むタグをく np >とすることで、係り受けの関係を示すことができる。  [0116] Also, the tag indicated by <n> can be shown as not being a word on the side where the dependency is received by setting np>. “North Kyushu” can be classified into “Kyushu”, “Region”, and “North”, respectively, with morphemes sandwiched between n>, because “Kyushu” is related to “Region” and “Region” is related to “North”. "Is unnecessary. On the other hand, in the case of “events (events, events), festivals”, “events (events, events)” are related to “no” regardless of “festival”. With>, the dependency relationship can be shown.
[0117] なお、「九州」のような場所を表す固有名詞、又は「太郎」のような人の名前を表す 固有名詞は、夫々く placename> <pername >のタグによって示すことができる。  [0117] A proper noun representing a place such as "Kyushu" or a proper noun representing a person's name such as "Taro" can be indicated by a tag of placename> <pername>, respectively.
[0118] 指示代名詞、ゼロ代名詞等の先行する語又は文から参照する形態素については、 照応関係を表す属性を用いて表すことができる。 GDAでは、属性名 idを用いて指示 代名詞、ゼロ代名詞が先行の語又は文の何れの語を示すかをあらわすことができる 。例えば、「右側にボタンがあるので、それを押してください。」という文に対して、人間 がこれを読む場合は「それ」が「ボタン」を指すことを自然に補完することができる。し かし、コンピュータで処理する場合は、辞書情報との照合によって「それ」が指示代名 詞であることを同定することはできる力 何を示しているかを判別することはできない。 そこで GDAでは、「それ」が示す「ボタン」に id属性を付加し、さらに、 id属性で示され た形態素との等価 (equal)関係を示す属性名 eqにより、「それ」 =「ボタン」を示すこと ができる。具体的には「右側にボタンがあるので、それを押してください。」に対し、「 右側にく np id= "Btn" >ボタンく/ np>があるので、く np eq = "Btn" >それ く/ np >を押してください。」とすることで (他のタグは省略)、「それ」 =「ボタン」の関 係を示すことができる。 [0118] A morpheme referenced from a preceding word or sentence such as a demonstrative pronoun or a zero pronoun can be expressed using an attribute indicating an anaphoric relationship. In GDA, the attribute name id can be used to indicate whether the pronoun or zero pronoun indicates the preceding word or sentence. For example, for a sentence “There is a button on the right side, please press it”, if a human reads this, it can be naturally supplemented that “it” refers to a “button”. Shi However, when it is processed by a computer, it is not possible to determine what it is to show that “it” can be identified as a directive pronoun by checking against dictionary information. Therefore, in GDA, the id attribute is added to the “button” indicated by “it”, and “it” = “button” is set by the attribute name eq indicating the equality relationship with the morpheme indicated by the id attribute. Can show. Specifically, “There is a button on the right side, so press it.” In contrast, “There is a button np id =“ Btn ”> button / np> on the right side, so np eq =“ Btn ”> it Press </ np> ”(other tags are omitted) to indicate the relationship of“ it ”=“ button ”.
[0119] ゼロ代名詞に対しては、 eq属性を付加できる代名詞そのものがない。したがって、「 それ」 =「ボタン」を動作の対象とする「押し」という動詞に、対象を明示する情報を付 加することで、ゼロ代名詞が表す対象を示すことができる。そこで、タグではさんだ形 態素の動作の対象 (object)を示す属性名 objにより、「押し」 、う動作の対象が「ボ タン」であることを示すことができる。具体的には、「右側にボタンがあるので、押してく ださい。」という文に対し、「右側にく np id= "Btn" >ボタンく/ 1^ >があるので、 <v obj = "Btn" >押しく 7 >てください。」とすることで、省略された対象との関係 を明示することができる。  [0119] For zero pronouns, there is no pronoun itself to which the eq attribute can be added. Therefore, it is possible to indicate the object represented by the zero pronoun by adding information that clearly indicates the object to the verb “push” whose action is “it” = “button”. Therefore, the attribute name obj indicating the object of the action of the morpheme that is sandwiched by the tag can indicate that the object of the push action is “button”. Specifically, for the sentence “There is a button on the right side, please press it”, there is “np id =“ Btn ”> button / 1 ^> on the right side, so <v obj =“ Btn You can specify the relationship with the omitted target by saying "> Press 7> Please".
[0120] また、参照される語と参照する語とが離れている場合であっても、上述の id属性、 e q属性、 obj属性によってその照応関係を示すことができる。例えば、「右側にく np i d= "Btn" >ボタンく Znp >があります。」「< np 6 = ' 1 1,,>それ<
Figure imgf000035_0001
>には Xのマークがついています。」「停止する際にく v obj = "Btn" >押しく/ v>てくだ さい。」とすることによって、第 2文の「それ」が「ボタン」を示すこと、及び第 3文の「押し 」の対象が「ボタン」であることを示すことができる。
[0120] Further, even when the word to be referred to is different from the word to be referred to, the corresponding relationship can be indicated by the id attribute, the eq attribute, and the obj attribute described above. For example, there is “np id =“ Btn ”> button“ Znp> ”on the right side.” “<Np 6 = '1 1,.
Figure imgf000035_0001
> Is marked with an X. "When you stop v obj =" Btn "> Press / v>", the second sentence "it" indicates "button" and the third sentence " It can indicate that the object of “push” is a “button”.
[0121] また、各形態素を挟むく n> < ad> <v>等ののタグの属性情報には、形態素 (m orpheme)解析の結果を示す情報が属性名 mphで付加される。属性値は、形態素 解析によって取得できた形態素の品詞情報、基本形情報、発音情報等を示す。具体 的には、属性名 mphに対し、付加情報、品詞情報、活用形情報、基本形情報、及び 発音情報を属性値とし、 mph = "付加情報;品詞情報;活用形情報;基本形情報;発 音情報"と表す。図 5に示した例において「九州」は、品詞情報を名詞 +固有名詞 + 地域 +—般で分類することができ、基本形は九州であり「キユウシユウ」と発音すること がく mph>タグによって明示される。なお、本発明では、形態素解析及び統語解析 を chasenで提示される方法に基づ ヽて行って ヽるため、形態素の付加情報として ch asenと 、う識別情報が付加されて 、る。 [0121] In addition, information indicating the result of morpheme analysis is added to the attribute information of a tag such as n><ad><v> that sandwiches each morpheme with the attribute name mph. The attribute value indicates part-of-speech information, basic form information, pronunciation information, etc. of the morpheme obtained by morpheme analysis. Specifically, for the attribute name mph, additional information, part-of-speech information, inflected form information, basic form information, and pronunciation information are attribute values, and mph = “additional information; part-of-speech information; inflected form information; basic form information; Information ". In the example shown in Figure 5, “Kyushu” uses part of speech information as a noun + proper noun + It can be classified by region + —general, the basic form is Kyushu, and it can be pronounced “Kyuushiyu” and is clearly indicated by the mph> tag. In the present invention, since morphological analysis and syntactic analysis are performed based on the method presented in chasen, identification information such as chasen is added as additional information of the morpheme.
[0122] 上述のように、文単位検索装置 1の CPU11は Webクローリングによって取得した文 書データに対し、形態素解析及び統語解析の結果を GDAの規則に則ってタグ付け し、タグ付けした結果である XMLデータを文書集合接続手段 16を介して文書記憶 手段 2に記憶させる。文書データを XMLデータで記憶しておくことにより、文単位検 索装置 1の CPU 11は当該文書データのタグを文字列解析によって識別し、タグに付 カロされた属性情報を識別することによって各形態素の情報及び文法的関係を特定 することができる。 [0122] As described above, the CPU 11 of the sentence unit search apparatus 1 tags the document data obtained by Web crawling with the results of tagging the results of morphological analysis and syntactic analysis according to GDA rules. Certain XML data is stored in the document storage means 2 via the document set connection means 16. By storing the document data as XML data, the CPU 11 of the sentence unit search apparatus 1 identifies the tag of the document data by character string analysis, and identifies the attribute information attached to the tag to identify each attribute data. Can identify morpheme information and grammatical relationships.
[0123] さらに文単位検索装置 1の CPU11は、 Webクローリングによって取得した文書デ ータを形態素解析する際に、取得した全文書データに出現する単語を抽出して識別 番号を割り振りリストで記憶手段 13に記憶する。図 6は、実施の形態 1における文単 位検索装置 1の CPU11が取得した全文書データ力 抽出した単語のリストの例を示 す説明図である。図 6の説明図に示す例では、 31245個の単語がリストとして挙げら れている。なお、記憶される単語からは、「こと」、「もの」などのありふれた単語は除か れる。接続詞又は冠詞同様一般的すぎる言葉であり、頻繁に出現するにも拘わらず 、その単語自体は意味をなさないために検索処理に負担がかかり、検索対象として 不適切であるからである。  [0123] Further, the CPU 11 of the sentence unit search device 1 extracts a word appearing in all the acquired document data and stores an identification number in a list when the morphological analysis is performed on the document data acquired by Web crawling. Memorize in 13. FIG. 6 is an explanatory diagram illustrating an example of a list of extracted words for all document data acquired by the CPU 11 of the sentence unit search device 1 according to the first embodiment. In the example shown in the explanatory diagram of FIG. 6, 31245 words are listed. It should be noted that common words such as “thing” and “thing” are excluded from the stored words. This is because the word is too general like a conjunction or article, and although it appears frequently, the word itself does not make sense, so the search processing is burdensome and inappropriate as a search target.
[0124] 3.文書データの文毎の意味のまとまりの定量ィ匕  [0124] 3. Quantification of the meaning of each sentence in document data
3- 1.文毎の意味のまとまりの定義  3- 1. Definition of the meaning of each sentence
次に、文単位検索装置 1の CPU 11は、文書記憶手段 2で記憶した文書データ中 の一文毎に当該文の意味のまとまりを定量的に表す情報を特定する。文の意味のま とまりを定量的に表す情報とは、ユーザが当該文を使用(発話、筆記、聴取又は読解 )するときに、ユーザが注目している単語群と、ユーザが各単語に注目する度合い、 即ち顕現性 (salience)を定量的に示す値 (単語の重み値)とで表す。  Next, the CPU 11 of the sentence unit search device 1 specifies information that quantitatively represents a group of meanings of the sentence for each sentence in the document data stored in the document storage unit 2. Information that quantitatively expresses the meaning of a sentence means a group of words that the user is paying attention to when the user uses the sentence (speaking, writing, listening, or reading), and the user pays attention to each word. This is expressed by a value (word weight value) that quantitatively indicates the degree of salience.
[0125] 各単語の文中での顕現性は、従来の検索サービスによってされてきた出現頻度に よって定量化することもできる。しかしながら、出現頻度は文書、又は文書集合全体 を母体として求めるものである。したがって、文書毎に各単語の出現頻度を算出する ことで、文書全体の意味のまとまりを定量的に表すことはできても、文書中での流れ に応じて一文毎に動的に変化する文脈を反映した意味のまとまりを表すことはできな い。 [0125] The manifestation of each word in the sentence depends on the frequency of appearance that has been achieved by conventional search services. Therefore, it can also be quantified. However, the appearance frequency is obtained based on the document or the entire document set. Therefore, by calculating the appearance frequency of each word for each document, it is possible to quantitatively represent the meaning of the whole document, but the context changes dynamically according to the flow in the document. It cannot represent a set of meanings that reflect
[0126] また、単語の文中での顕現性は、先行する文での当該単語の注目度、現在の文で の当該単語の注目度の遷移をその単語の使用のされ方で文法的に区別して表すこ とができる。つまり、先行する文で主題 (主語)であった単語が現在の文でも主題 (主 語)である場合は、現在の文で当該単語は一番注目されて 、る顕現性の高 、単語で あると 、える。これに対し先行する文では出現して!/、な 、が現在の文で主題 (主語) である単語は、現在の文で注目されているものの、前述の主題として使用され続ける 場合に比べて顕現性は低いといえる。この顕現性の定式ィ匕は、中心化理論 (Grosz e t al., 1995、 Nariyama, 2002、 Poesio et al., 2004)として研究が続けられている。  [0126] In addition, the manifestation of a word in a sentence is grammatically defined by the degree of attention of the word in the preceding sentence and the transition of the degree of attention of the word in the current sentence depending on how the word is used. It can be expressed separately. In other words, if the word that was the subject (subject) in the preceding sentence is also the subject (subject) in the current sentence, the word is the most noticeable in the current sentence, and it is highly obvious. Yes, there is. On the other hand, words that appear in the preceding sentence! /, Na, are the subject (subject) in the current sentence, although they are attracting attention in the current sentence, but continue to be used as the subject mentioned above. It can be said that the obviousness is low. This manifestation formula 匕 has been studied as a centralized theory (Grosz e t al., 1995, Nariyama, 2002, Poesio et al., 2004).
[0127] 中心化理論による定式ィ匕では、各単語の顕現性をコンピュータ等で定量的に計算 するための特徴量として表わされていない。各単語の遷移の仕方が中心化理論で定 義される遷移の仕方の何れに属するか否かが判別できるに過ぎない。そこで本発明 では各単語の各文での顕現性を定量的に算出する。  [0127] In the formula 匕 based on the centralization theory, the manifestation of each word is not represented as a feature value for quantitative calculation by a computer or the like. It is only possible to determine whether the transition of each word belongs to one of the transitions defined by the centralization theory. Therefore, the present invention quantitatively calculates the manifestation of each word in each sentence.
[0128] 実施の形態 1では、単語毎に各文単位での参照確率を算出し、算出した参照確率 を各単語の文単位での顕現性を表わす重み値として付与する。  In Embodiment 1, the reference probability for each sentence is calculated for each word, and the calculated reference probability is assigned as a weight value representing the manifestation of each word for each sentence.
[0129] なぜなら、単語が当該文で注目されているほど、継続して後続の文でも出現又は参 照される確率が高!、ことから、後続の文で出現する確率又は後続の文から参照され る確率を参照確率とし、当該単語の顕現性と捉えることができるからである。また、単 語が後続の文で出現又は参照される参照確率は、定量的に扱うことが困難な単語の 意味を特徴とするのではなぐ文単位検索装置 1による情報処理によって解析可能な 、単語が出現するパターン又は参照するパターンを含む特徴パターンを特定し、特 定した特徴パターンと同一の特徴パターンで出現又は参照される単語が実際に後続 の文で出現又は参照される割合が参照確率として算出される。  [0129] Because, as the word is attracted attention in the sentence, the probability that it will continue to appear or be referenced in the subsequent sentence is high! This is because the probability of being played is used as the reference probability, and can be regarded as the manifestation of the word. In addition, the reference probability that a word appears or is referenced in a subsequent sentence is a word that can be analyzed by information processing by the sentence unit search device 1 that does not feature the meaning of a word that is difficult to handle quantitatively. A feature pattern that includes a pattern that appears or includes a reference pattern is identified, and the percentage of words that appear or referred to in the same feature pattern as the specified feature pattern actually appear or referred to in subsequent sentences is used as the reference probability. Calculated.
[0130] 以下、単語毎の参照確率を各単語の文単位での重み値とし、夫々の重み値が付 与された当該文での単語の集合を重み付き単語群という。各文単位の意味のまとまり は、参照確率という定量的な重み値が付与された重み付き単語群で表わすことがで きる。 [0130] Hereinafter, the reference probability for each word is defined as a weight value for each word, and each weight value is assigned. A set of words in the given sentence is called a weighted word group. A group of meanings for each sentence unit can be expressed by a weighted word group to which a quantitative weight value called a reference probability is given.
[0131] 3 - 2.回帰モデル学習  [0131] 3-2. Regression model learning
参照確率の算出は、特定した特徴パターンと同一の特徴パターンが出現した数に 対して、同一の特徴パターンのうち当該単語が実際に後続の文で出現又は参照され る割合をその参照確率として求める。この際、特定した特徴パターンと同一の特徴パ ターンが夫々の特徴パターン毎に多量に且つほぼ同数で出現する場合は、統計的 に問題なく参照確率を算出することができる。しかし、実際に同一の特徴パターンが 出現する数は限られ、信頼に足り得る参照確率を算出するには膨大な文書データが 必要となる。したがって、後続の文で出現又は参照されるか否かをその事象の発生 の要因である単語の特徴パターンから予測するための回帰式を、特徴パターンと実 際に後続の文で出現又は参照されたかの事象とで回帰モデル学習をすることによつ て求める。  In calculating the reference probability, the number of occurrences of the same feature pattern as the specified feature pattern is calculated as the reference probability of the ratio of the same feature pattern in which the word actually appears or referenced in the subsequent sentence. . At this time, if the same feature pattern as the identified feature pattern appears in a large amount and almost the same number for each feature pattern, the reference probability can be calculated statistically without any problem. However, the actual number of identical feature patterns is limited, and enormous amounts of document data are required to calculate reliable reference probabilities. Therefore, a regression equation for predicting whether or not a subsequent sentence appears or is referenced from the feature pattern of a word that is a factor of the occurrence of the event is used as a feature pattern and actually appears or referenced in the subsequent sentence. This is obtained by learning a regression model with the events.
[0132] 以下、回帰モデル学習のためのサンプルである特徴パターンに対する「3— 2—1. 特徴パターンの特定」と特徴パターンを用いた「3— 2— 2.回帰式の学習」とに段階 を分けて説明する。  [0132] The following are the steps for “3-2-1. Identifying feature patterns” and “3-2-2. Learning regression equations” for feature patterns, which are samples for regression model learning. Are described separately.
[0133] 3— 2— 1.特徴パターンの特定  [0133] 3— 2— 1. Feature pattern identification
文書記憶手段 2に記憶してある文書データ中の文は < su >で示すタグで挟まれ、 当該文で出現する単語、若しくは文の中の指示代名詞又はゼロ代名詞と照応関係 にある単語は、タグの属性情報により特定することが可能である。そこで、本発明の文 単位検索装置 1では、文書記憶手段 2で記憶した文書データに対し、特徴パターン を以下のように特定する。  Sentences in the document data stored in the document storage means 2 are sandwiched between tags indicated by <su>, and words that appear in the sentence, or words that have an anaphoric relationship with a pronoun or zero pronoun in the sentence, It can be specified by tag attribute information. Therefore, in the sentence unit search device 1 of the present invention, the feature pattern is specified as follows for the document data stored in the document storage means 2.
[0134] 文書データ中の一の文 sと、当該文書データ中での一の文に対する先行する文に 含まれる単語 wの対をサンプル (s, w)とする。当該サンプルに対する特徴パターン f (s, w)は、以下の特徴量によって特定される。文 sと、文 sより先行する文のうち単語 w 力 最近に出現又は参照された文 sとの距離 (文の数)の特徴量 (dist)、文 sより先行 する文で単語 wが、最近に出現又は参照された場合、単語 wが係っている助詞の特 徴量 (gram)、及び文 sより先行する文で単語 wが出現又は参照された数 (chain)の 特徴量等を例として挙げることができる。なお、特徴量はこれに限らず、単語 wが最 近のトピックを示す単語であるか否カゝ、又は単語 wがー人称であるカゝ否か等でもよい [0134] A sample (s, w) is a pair of one sentence s in the document data and a word w included in a sentence preceding the one sentence in the document data. The feature pattern f (s, w) for the sample is specified by the following feature amount. The feature (dist) of the distance (number of sentences) between the sentence s and the word w force among the sentences preceding the sentence s, and the sentence s that recently appeared or referenced, and the word w in the sentence preceding the sentence s If it appears or is referenced recently, the particle characteristic associated with the word w For example, the feature amount (gram) and the feature amount of the number (chain) in which the word w appears or is referenced in the sentence preceding the sentence s can be given as examples. Note that the feature amount is not limited to this, and may be whether or not the word w is a word indicating a recent topic, or whether or not the word w is a personality.
[0135] 文書記憶手段 2で記憶した文書データには形態素解析及び統語解析の結果が G DAに則ったタグによって記述されているため、文書データの文字列解析によってタ グ< su>で区切られる文の分別及び計数、各文内のタグで示される品詞情報による 助詞の特定、指示代名詞又はゼロ代名詞で参照するものも含んだ単語の出現回数 の計数が可能である。したがって、文単位検索装置 1の CPU 11は、 GDAに則ったタ グ及びその属性値を解析することで各サンプルに対する特徴量 dist, gram, chain を特定することができる。 [0135] In the document data stored in the document storage means 2, the results of morphological analysis and syntactic analysis are described by tags conforming to the GDA, so they are delimited by the tag <su> by character string analysis of the document data. Sentence classification and counting, identification of particles based on part-of-speech information indicated by tags within each sentence, and counting of the number of occurrences of words including those referred to by demonstrative pronouns or zero pronouns are possible. Therefore, the CPU 11 of the sentence unit search device 1 can specify the feature quantities dist, gram, and chain for each sample by analyzing the tag and its attribute value according to GDA.
[0136] 文単位検索装置 1の CPU11が、文書記憶手段 2で記憶しているタグ付け済みの文 書データに対しサンプルを抽出し、抽出したサンプルに対して特徴量を求めて特徴 ノターンを特定し、抽出したサンプルの特徴パターン力も参照確率を算出するため 回帰式を回帰分析により推定する処理手順について説明する。図 7は、実施の形態 1における文単位検索装置 1の CPU11が、文書記憶手段 2で記憶して 、るタグ付け 済み文書データ力 サンプルを抽出し、回帰分析を行って参照確率を算出するため の回帰式を推定する処理手順を示すフローチャートである。図 7のフローチャートに 示す処理は、分別した文単位毎に特徴パターンを特定する処理、及び、特徴パター ンと、特定された単語が後続の文単位で出現又は参照された力否かの判定結果とに 基づいて参照確率を算出するための回帰学習を実行する処理に対応する。  [0136] The CPU 11 of the sentence unit search device 1 extracts a sample from the tagged document data stored in the document storage means 2, and obtains a feature amount from the extracted sample to identify a feature pattern. The processing procedure for estimating the regression equation by regression analysis is also described to calculate the reference probability of the feature pattern force of the extracted sample. FIG. 7 shows a case in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 extracts a tagged document data force sample stored in the document storage means 2 and performs a regression analysis to calculate a reference probability. It is a flowchart which shows the process sequence which estimates these regression equations. The process shown in the flowchart of FIG. 7 includes a process for identifying a feature pattern for each sentence unit, and a result of determining whether or not the feature pattern and the identified word appear or are referenced in subsequent sentence units. This corresponds to the process of performing regression learning to calculate the reference probability based on.
[0137] 文単位検索装置 1の CPU11は、文書記憶手段 2から文書集合接続手段 16を介し てタグ付け済みの文書データを取得する(ステップ S21)。 CPU11は、取得した文書 データに付加されたタグく su>を文字列解析によって識別して文に分別する (ステツ プ S22)。次に CPU11は、文を示すく su>内の各タグを文字列解析によって識別し 、文に対し当該文で出現する単語又は参照される単語を対応付けてサンプルを抽出 する (ステップ S23)。抽出したサンプルに対し、タグを文字列解析によって識別して d 1st, gram, chainからなる特徴パターンを特定する(ステップ S 24)。 [0138] CPU11は、分別した文が取得した文書データの終端である力否かを判断し (ステ ップ S25)、 CPU11が、分別した文が文書データの終端でないと判断した場合は(S 25: NO)、 CPU 11は処理をステップ S 22に戻し、後続の文についてく su>タグを 識別することで分別する処理を継続する。分別した文が取得した文書データの終端 であるか否かは、例えば現在分別した文を挟むく SU > < /SU >の後に、く 311 >タ グが後続する力しな 、かを判断し、後続しな 、と判断した場合は終端であると判断す ることがでさる。 The CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S21). The CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis and sorts it into sentences (step S22). Next, the CPU 11 identifies each tag in su> indicating the sentence by character string analysis, and extracts a sample by associating the word appearing in the sentence or the word to be referred to with the sentence (step S23). A tag is identified by character string analysis for the extracted sample, and a feature pattern consisting of d 1st, gram, and chain is specified (step S 24). [0138] CPU 11 determines whether or not the separated sentence is the end of the acquired document data (step S25), and if CPU 11 determines that the separated sentence is not the end of the document data (S 25: NO), CPU 11 returns the process to step S22, and continues the process of sorting by identifying the su> tag in the subsequent sentence. Whether the sorted sentence is the end of the acquired document data is determined by, for example, whether or not it is a force that the tag is followed by SU></SU> that includes the currently sorted sentence. If it is determined that it does not follow, it can be determined that it is the end.
[0139] 一方、 CPU11が文書データの終端であると判断した場合は(S25 :YES)、 CPU1 1は、所定の数のサンプルの抽出が終了したか否かを判断する (ステップ S26)。 CP Ul 1がサンプルの抽出が終了して!/、な!/、と判断した場合は(S26: NO)、 CPU 11は 、処理をステップ S21へ戻し、異なるタグ付け済みの文書データを取得し、サンプル の抽出を継続する。  On the other hand, when the CPU 11 determines that it is the end of the document data (S25: YES), the CPU 11 determines whether or not extraction of a predetermined number of samples is completed (step S26). When CP Ul 1 determines that sample extraction is complete! /, N! / (S26: NO), CPU 11 returns the process to step S21 to obtain different tagged document data. Continue sample extraction.
[0140] CPU11がサンプルの抽出が終了したと判断した場合は(S26 :YES)、 CPU11は 、抽出したサンプルに対して回帰分析を行い、各特徴量 dist, gram, chainに対す る回帰式の回帰係数を推定し (ステップ S 27)、処理を終了する。  [0140] If the CPU 11 determines that the sample extraction is completed (S26: YES), the CPU 11 performs a regression analysis on the extracted sample and obtains a regression equation for each feature quantity dist, gram, and chain. Estimate the regression coefficient (step S27) and end the process.
[0141] 次に、文単位検索装置 1の CPU11による上述の処理の詳細を、具体例を挙げて 説明する。  Next, details of the above-described processing by the CPU 11 of the sentence unit search device 1 will be described with a specific example.
[0142] 図 8は、実施の形態 1における文書記憶手段 2で記憶された文書データ中の文で 特定される特徴パターンの例を示す説明図である。図 8に示す文 sでの、当該文 sと 、先行する文に含まれる単語「太郎君」とのサンプル (s ,太郎君)の特徴パターン f (s ,太郎君)は以下のようにして特定される。現在の文 Siと、先行する文のうち最近に、 単語「太郎君」が出現又は参照された文 s との距離の特徴量 (dist)は、 sの直後に FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in the document data stored in the document storage unit 2 according to the first embodiment. The characteristic pattern f (s, Taro-kun) of the sample (s, Taro-kun) of the sentence s in the sentence s shown in Figure 8 and the word “Taro-kun” in the preceding sentence is as follows: Identified. The distance feature (dist) between the current sentence Si and the sentence s where the word “Taro-kun” appeared or referred to recently in the preceding sentence is immediately after s.
i-1 i  i-1 i
続く文 s までの文の数 2であるため dist= 2である。また、最近「太郎君」が出現又は i+1  Since the number of sentences up to the following sentence s is 2, dist = 2. Recently, “Taro-kun” appeared or i + 1
参照された S での単語「太郎君」(彼で参照)が係っている助詞は「は」であるため、 g  Because the particle with the word “Taro-kun” (referred to him) in the referenced S is “ha”, g
i-1  i-1
ram=ハである。更に、文 sより先行の文 s , s で単語「太郎君」が出現又は参照さ  ram = C Furthermore, the word “Taro-kun” appears or is referenced in sentences s and s preceding sentence s.
i i-2 i-1  i i-2 i-1
れたため chain = 2である。したがって、特徴パターンは f (s ,太郎君) = (dist = 2, g ram=ハ, chain = 2)と特定される。英語の場合、 gramは前置詞によって特定され る。 [0143] 上述のように、文書データ中の文力もサンプル (s, w)を抽出し、抽出した全サンプ ルに対して特徴パターン f (s, w)を特定する。 Therefore, chain = 2. Therefore, the feature pattern is specified as f (s, Taro-kun) = (dist = 2, g ram = c, chain = 2). In English, gram is specified by preposition. [0143] As described above, the sample (s, w) is also extracted from the sentence power in the document data, and the feature pattern f (s, w) is specified for all the extracted samples.
[0144] 3 - 2- 2.回帰式の学習 [0144] 3-2- 2. Learning regression equations
次に、図 7のフローチャートに示したステップ S27の回帰分析について、詳細な処 理を説明する。  Next, detailed processing will be described for the regression analysis in step S27 shown in the flowchart of FIG.
[0145] 実施の形態 1では、 Logistic Regressionモデルに基づいて回帰分析を行う。回 帰分析はこれに限らず、 kNN (k— Nearest Neighbors)平滑化 + Support Vec tor Regression (SVR)モデルなど、他の回帰分析の手法を使用してもよい。  [0145] In the first embodiment, regression analysis is performed based on a logistic regression model. The regression analysis is not limited to this, and other regression analysis methods such as kNN (k—Nearest Neighbors) smoothing + Support Vector Regression (SVR) model may be used.
[0146] kNN平滑ィ匕 + SVRモデルを使用する場合、扱うことのできる特徴パターンの特徴 量として、次の 8要素を使用して回帰モデルの学習ができる。 8要素とは、前述の dist 、 gram, chainにカ卩えて、以下の 5要素を特徴量として扱うことができる。一つは、先 行の文単位の内で単語 wを参照した場合の名詞の種別 (exp,代名詞: 1Z非代名 詞: 0)でもよい。また、他の一つは、その単語 wが先行の文単位において出現又は 参照されている場合に主題であるか否か(last— topic, yes : lZno : 0)でもよい。他 の一つは単語 wが先行の文単位において出現又は参照されている場合に主語であ るか否か(last— sbj, yes : l/no : 0)でもよい。他の一つは、サンプル , w)におい て、単語 wがー人称であるか否か(pi, yes : l/no : 0)でもよい。他の一つは、単語 wが出現又は参照されて 、る直近の先行の文単位での単語 wの品詞情報 (pos,名 詞: 1、動詞: 2、等)でもよい。さらに他の一つは、単語 wが文書中のタイトル又は見 出しで参照されているか否か (in_header、 yes : lZno : 0)でもよい。さらに、音声デ ータに基づいて回帰分析する場合、 8要素の内の 1つとして、単語の直近の参照箇 所の発話時刻力ゝらの秒数 (time— dist)、単語の直近の参照箇所を含む文節の 1音 節あたりの発話速度(の話者平均に対する比)(syllable— speed)、単語の直近の参 照箇所を含む文節の,最低発話音高と最高発話音高の周波数比 (pitch— fluct)の 内の 、ずれか一又は複数を使用することができる。音声データの特徴量につ!ヽても 回帰分析することにより、後述するように文単位検索装置 1の CPU11が言葉のデー タとして音声データを受信した場合に、その特徴量力も参照確率を算出することがで きる。 [0147] このように、 kNN平滑ィ匕 + SVRモデルを使用する場合、より詳細な特徴量に基づ いて参照確率を算出することができ、より緻密な参照確率を算出ことができる。 [0146] When the kNN smoothness + SVR model is used, the regression model can be learned using the following 8 elements as the feature quantities of the feature pattern that can be handled. With the 8 elements, the following 5 elements can be handled as feature values in addition to the dist, gram, and chain described above. One may be the type of noun (exp, pronoun: 1Z non-pronoun: 0) when the word w is referenced within the previous sentence unit. Another one may be whether the word w is the subject when it appears or is referenced in the previous sentence unit (last-topic, yes: lZno: 0). The other may be whether the word w is the subject when it appears or is referenced in the preceding sentence unit (last—sbj, yes: l / no: 0). The other one may be whether the word w is a personal person (pi, yes: l / no: 0) in the sample, w). Another one may be the part of speech information (pos, noun: 1, verb: 2, etc.) of the word w in the immediately preceding sentence unit when the word w appears or is referenced. Another one may be whether the word w is referenced in the title or heading in the document (in_header, yes: lZno: 0). In addition, when performing regression analysis based on speech data, one of eight elements is the time-dist of the nearest reference location of the word (time—dist), the latest reference of the word. Speaking speed per syllable of the phrase containing the phrase (ratio to the average of the speakers) (syllable-speed), frequency ratio of the lowest utterance pitch and the highest utterance pitch of the phrase including the reference part closest to the word Any one or more of (pitch—fluct) can be used. Even if the feature amount of the voice data! Even if the regression analysis is performed, the CPU 11 of the sentence unit search device 1 receives the voice data as the word data as will be described later, the feature amount power also calculates the reference probability. can do. [0147] As described above, when the kNN smoothness + SVR model is used, the reference probability can be calculated based on a more detailed feature amount, and a more precise reference probability can be calculated.
[0148] 本実施の形態 1では、文 sの後続の文 s で単語 wが実際に出現又は参照された[0148] In the first embodiment, the word w actually appears or is referenced in the sentence s following the sentence s.
+1  +1
か否かを被説明変数、サンプル (s , w)に対して特定された特徴パターンの dist、 gr am、 chainを特徴量とし、全サンプル(s, w)に対して、 Logistic Regressionモデ ルにより回帰分析する。これにより、 dist, gram, chainという特徴量が与えられた場 合に、 s で単語 wが出現又は参照される確率 Pr(s , w)を算出するための回帰式 i+1 i+1  Whether the feature pattern is dist, gram, or chain for the sample variable (s, w) and whether it is a feature quantity or not, the logistic regression model is used for all samples (s, w). Regression analysis. As a result, when the feature quantities dist, gram, and chain are given, the regression equation for calculating the probability Pr (s, w) that the word w appears or is referenced by s i + 1 i + 1
を得ることができる。  Can be obtained.
[0149] Logistic Regressionモデルで求められる確率は、一般的に、説明変数(特徴量) xl, x2, ···, xnに対して以下の式(1)で求められる。  [0149] The probability obtained by the Logistic Regression model is generally obtained by the following equation (1) with respect to the explanatory variables (features) xl, x2, ···, xn.
[0150] [数 1] [0150] [Equation 1]
Pr = . . . ( 1 ) Pr =... (1)
1 + exp(b0 + bi j +b2X2 +■■· + bnxn ) 1 + exp (b 0 + bi j + b 2 X2 + ■ + b n x n )
[0151] 式(1)のパラメータ(回帰係数) b , b , ···, b は、学習するサンプル力 最尤法に [0151] The parameters (regression coefficients) b, b,..., B in Eq.
0 1 n  0 1 n
よって推定する。本発明で算出する文 sでの単語 Wの参照確率の回帰分析とは、被 説明変数を、後続の文 s で出現又は参照されないサンプルは 0、出現又は参照され i+1  Therefore, estimate. The regression analysis of the reference probability of the word W in the sentence s calculated by the present invention means that the explained variable is 0, the sample that does not appear or is referenced in the subsequent sentence s, is 0 or appears or is referenced
るサンプルは 1とし、説明変数を特徴量である dist, gram, chainとし、抽出したサン プルを学習して、以下の式(2)のパラメータ(回帰係数) b , b , b , bを推定するこ  The sample is set to 1, and the explanatory variables are dist, gram, and chain, which are feature quantities, and the extracted samples are learned to estimate the parameters (regression coefficients) b, b, b, and b in the following equation (2) To do
0 1 2 3  0 1 2 3
とを指す。  And point to.
[0152] [数 2] [0152] [Equation 2]
Pr = . . . (2) Pr =... (2)
1 + exp(bo + bydist + Ingram + b^chain)  1 + exp (bo + bydist + Ingram + b ^ chain)
[0153] 抽出したサンプル力も学習したパラメータ(回帰係数)は、例えば b =— 1.425、 b [0153] The parameter (regression coefficient) that also learned the extracted sample force is, for example, b = — 1.425, b
0 1 0· 564、 b =11. 036、 b =3. 115と推定される(10000サンプノレカら回' J帚分 析)。この場合、これらのパラメータを当てはめた式(3)が参照確率を求めるための回 帰式である。 0 1 0 · 564, b = 11. 036, b = 3. 115 estimated (10000 Sampu Noreka et al. 'J apportion Analysis). In this case, Equation (3) that applies these parameters is a recursive equation for obtaining the reference probability.
[0154] [数 3] [0154] [Equation 3]
Pr = . . . ( 3 )Pr =... (3)
1 + exp (- 1.425 - 0.564 x dist + 1 1.036 x gram + 3.1 15 x chain) 1 + exp (-1.425-0.564 x dist + 1 1.036 x gram + 3.1 15 x chain)
[0155] 推定されるパラメータ(回帰係数) b , b , b , b の値は、文書記憶手段 2で記憶す [0155] Estimated parameters (regression coefficients) The values of b 1, b 2, b 3 and b are stored in the document storage means 2
0 1 2 3  0 1 2 3
る文書データによって異なる。例えば、文書記憶手段 2で記憶する文書データが書き 言葉である新聞記事のみ力 なる場合と話し言葉である発話を文書データに変換し たもののみ力もなる場合とでは、夫々推定されるパラメータは異なる。また、書き言葉 として同種の新聞記事のみ力もなる文書データに対しても、その文書データの量、文 書データの文書の内容によって推定されるパラメータの値 b , b , b , b は異なる。  It depends on the document data. For example, the estimated parameters differ depending on whether the document data stored in the document storage means 2 is useful only for newspaper articles that are written words or only if the utterances that are spoken words are converted to document data. In addition, even for document data that only has the same kind of newspaper articles as written words, the estimated parameter values b 1, b 2, b 3, and b differ depending on the amount of the document data and the content of the document data.
0 1 2 3 そこで本発明では、話し言葉での回帰分析のために、書き言葉と話し言葉とで区別 して文書データを記憶しておき、話し言葉力もなる文書データに対しても回帰分析に よってパラメータを推定し、参照確率を算出するための回帰式を記憶しておく。なお、 受付装置 4, 4,…で受け付ける言葉が、音声入力された発話ではなく文字入力によ つて書き言葉力もなる文章を入力したものに限定されている場合は、話し言葉と書き 言葉とで文書データを区別せずに文書記憶手段 2で記憶する構成としてもよい。  0 1 2 3 Therefore, in the present invention, for regression analysis in spoken language, document data is stored separately for written language and spoken language, and parameters are estimated by regression analysis even for document data with spoken language ability. Then, the regression equation for calculating the reference probability is stored. If the words accepted by the accepting devices 4, 4,... Are limited to texts that are written and that can be written by text input instead of speech, the document data is spoken and written. Alternatively, the document storage means 2 may store them without distinguishing them.
[0156] 以上の回帰分析により、式(3)の回帰式の特徴量 dist, gram, chainに対するパラ メータが求められる。したがって、文単位検索装置 1の CPU 11が文単位の各単語の 特徴量 dist, gram, chain力もなる特徴パターンを特定することにより、当該特徴パ ターンを有する単語の参照確率を算出することができる。 [0156] Through the above regression analysis, parameters for the characteristic quantities dist, gram, and chain of the regression equation of Equation (3) are obtained. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability of the word having the feature pattern by specifying the feature pattern having the feature quantity dist, gram, and chain power of each word in the sentence unit. .
[0157] 3- 3.文単位毎の顕現性の定量ィ匕 [0157] 3- 3. Quantification of manifestation per sentence unit
回帰分析により回帰式が得られたため、文単位検索装置 1の CPU11は、文単位毎 に抽出された単語毎に特徴量 dist, gram, chainを特定することにより、単語毎の参 照確率を算出することができる。そこで、文単位検索装置 1の CPU11は、文書記憶 手段 2で記憶して 、るタグ付け済みの文書データを取得して文毎に分別し、当該文 で出現する単語又は参照する単語に対して特徴パターンを特定し参照確率を算出 する。これにより、先行する文の文脈上の意味が反映された文毎の意味のまとまりを 定量的に表すことができる。 Since the regression equation was obtained by regression analysis, the CPU 11 of the sentence unit search device 1 calculates the reference probability for each word by specifying the feature quantities dist, gram, and chain for each word extracted for each sentence unit. can do. Therefore, the CPU 11 of the sentence unit search device 1 acquires the tagged document data stored in the document storage means 2 and classifies the data for each sentence. A feature pattern is specified for the word that appears in or the word to be referenced, and the reference probability is calculated. As a result, it is possible to quantitatively represent a group of meanings for each sentence that reflects the contextual meaning of the preceding sentence.
[0158] 文単位検索装置 1の CPU11が回帰分析後に、文書記憶手段 2で記憶している文 書データの文毎に、単語及び単語毎の参照確率 (重み付き単語群)を算出する処理 について以下に説明する。  [0158] Regarding the processing in which the CPU 11 of the sentence unit search device 1 calculates the word and the reference probability (weighted word group) for each word for each sentence of the document data stored in the document storage means 2 after regression analysis. This will be described below.
[0159] 文単位検索装置 1の CPU11は、文書記憶手段 2で記憶している文書データを取 得して、文書データに含まれる文毎にその文と先行の文とにおける各単語の文法的 な特徴パターンを特定し、特定した特徴パターンと回帰式とに基づ 、て文毎に各単 語の参照確率を算出して予め記憶する。  [0159] The CPU 11 of the sentence unit search device 1 acquires the document data stored in the document storage means 2, and for each sentence included in the document data, grammatical of each word in the sentence and the preceding sentence. Specific feature patterns are identified, and the reference probabilities for each word are calculated for each sentence based on the identified feature patterns and regression equations, and stored in advance.
[0160] 文単位検索装置 1の CPU11は、各単語と夫々の単語の参照確率との組 (重み付き 単語群)を各文単位毎に対応付けて記憶しておく。即ち CPU11は、文書集合から取 得する全文書の全文について記憶する処理を行なう。一方、 CPU11は、後の検索 処理において、全文書の全文の内の、受け付けた言葉と文脈上の意味が類似する 文を抽出する。したがって、この場合、全文書の全文を一つ一つ読み出して夫々に 対応付けられている各文の文脈上の意味を表わす重み付き単語群を読み出すので は処理の負荷が大きい。  [0160] The CPU 11 of the sentence unit search apparatus 1 stores a pair of each word and the reference probability of each word (weighted word group) in association with each sentence unit. That is, the CPU 11 performs processing for storing all the texts of all the documents acquired from the document set. On the other hand, the CPU 11 extracts a sentence whose contextual meaning is similar to the accepted word in all the sentences of all the documents in a later search process. Therefore, in this case, it takes a heavy processing load to read out all the sentences of all the documents one by one and read out the weighted word group representing the contextual meaning of each sentence associated with each.
[0161] そこで、文単位検索装置 1の CPU11は、各文に対して先行の文の文脈上の意味 を表わした重み付き単語群を、後の処理で全文書の全文を一つ一つ読み出すことな しに抽出する処理を可能にするために、各文毎に算出した重み付き単語群をデータ ベース化して索引付けしておく処理を行なう。  [0161] Therefore, the CPU 11 of the sentence unit search apparatus 1 reads out the weighted word group representing the contextual meaning of the preceding sentence for each sentence one by one in the subsequent process. In order to make it possible to extract without any problem, the weighted word group calculated for each sentence is converted into a database and indexed.
[0162] 図 9及び図 10は、実施の形態 1における文単位検索装置 1の CPU11が、文書記 憶手段 2で記憶しているタグ付け済みの文書データの文毎に単語の参照確率を算 出し、記憶する処理手順を示すフローチャートである。図 9及び図 10のフローチヤ一 トに示す処理は、文単位毎に、各単語に対して特定した特徴パターンと、特徴パター ンに対応する回帰係数とを使用して参照確率を算出する処理、算出した参照確率を 単語との組で予め記憶しておく処理に対応する。  FIG. 9 and FIG. 10 show that the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2. It is a flowchart which shows the process sequence to take out and memorize | store. The process shown in the flow charts of FIGS. 9 and 10 is a process for calculating a reference probability using a feature pattern identified for each word and a regression coefficient corresponding to the feature pattern for each sentence unit. This corresponds to the process of storing the calculated reference probabilities in pairs with words.
[0163] 文単位検索装置 1の CPU11は、文書記憶手段 2から文書集合接続手段 16を介し てタグ付け済みの文書データを取得する(ステップ S301)。 CPUl lは、取得した文 書データに付加されたタグく SU>を文字列解析によって識別して文に分別する (ス テツプ S302)。次に CPUl lは、文を示すく su>内の各タグを文字列解析によって 識別し、文に対し、当該文で出現する単語又は参照される単語を抽出し (ステップ S3 03)、当該文書データについて参照確率の算出を行う間は、抽出した単語を一時記 憶領域 14で記憶する (ステップ S304)。 [0163] The CPU 11 of the sentence unit search device 1 sends the document storage means 2 to the document set connection means 16 via the document set connection means 16. The tagged document data is acquired (step S301). CPUll identifies the tag “ SU >” added to the acquired document data by character string analysis and classifies it into a sentence (step S302). Next, CPUl l identifies each tag in su> indicating the sentence by character string analysis, extracts words that appear in the sentence or words that are referenced in the sentence (step S3 03), and extracts the document. While the reference probability is calculated for the data, the extracted word is stored in the temporary storage area 14 (step S304).
[0164] CPU11は、一時記憶領域 14に記憶した、当該文を含む文書データについての単 語に対し、単語に付加されたタグを文字列解析によって識別して dist, gram, chain 力もなる特徴パターンを特定する (ステップ S305)。次に CPUl lは、特定した特徴 パターンの各特徴量を式(3)に代入し参照確率を算出する (ステップ S306)。  [0164] The CPU 11 identifies the tag added to the word by word analysis for the word of the document data including the sentence stored in the temporary storage area 14, and also has a dist, gram, and chain force. Is identified (step S305). Next, CPUll calculates the reference probability by substituting each feature quantity of the identified feature pattern into equation (3) (step S306).
[0165] CPUl lは、文に対する各単語の参照確率を、一時記憶領域 14で記憶している全 単語に対して算出した力否かを判断する (ステップ S307)。 CPUl lが全単語に対し て参照確率を算出していないと判断した場合は(S307 :NO)、 CPU11は、処理をス テツプ S305に戻し、他の単語についての特徴パターンの特定及び参照確率の算出 を継続する。一方、 CPU11が全単語に対して参照確率を算出したと判断した場合 は(S307 :YES)、 CPU11は、一時記憶領域 14で記憶している単語及び各単語に 対して算出した参照確率の組 (重み付き単語群)を salience属性を付加して記憶す る (ステップ S308)。この際、 CPU 11は参照確率を所定の値で絞込み、参照確率が 所定の値未満である単語にっ 、ては記憶しな 、。  [0165] CPUll determines whether or not the power of calculating the reference probability of each word for the sentence for all the words stored in temporary storage area 14 (step S307). If CPU11 determines that reference probabilities have not been calculated for all words (S307: NO), CPU11 returns processing to step S305 to identify feature patterns for other words and determine reference probabilities. Continue calculation. On the other hand, if the CPU 11 determines that the reference probabilities have been calculated for all the words (S307: YES), the CPU 11 sets the word stored in the temporary storage area 14 and the reference probabilities calculated for each word. The (weighted word group) is stored with the salience attribute added (step S308). At this time, the CPU 11 narrows down the reference probability by a predetermined value, and does not memorize words having a reference probability less than the predetermined value.
[0166] 次に、 CPU11は、現在の文に対して付カ卩した単語及び各単語の参照確率の組( 重み付き単語群)を後に抽出することができるように、索引付けして重み付き単語群 のデータベースに記憶する (ステップ s 309)。 CPU 11はデータベースを記憶手段 1 3に記憶してもよいし、文書集合接続手段 16を介して文書記憶手段 2に記憶してもよ い。なお、 CPU11は、索引付けの処理の 1つとして以下のような処理を実行する。  [0166] Next, the CPU 11 performs indexing and weighting so that a set of words and reference probabilities for each word (weighted word group) attached to the current sentence can be extracted later. Store in the word group database (step s309). The CPU 11 may store the database in the storage unit 13 or may store it in the document storage unit 2 via the document set connection unit 16. The CPU 11 executes the following process as one of the indexing processes.
[0167] CPU11は例えば、ステップ S308で得られた重み付き単語群の内の、一の単語の 参照確率に注目し、一の単語の参照確率が所定値以上である力否かを判定する。 次に、 CPU11は重み付き単語群の内の、他の一の単語の参照確率が所定値以上 であるか否かを判定する。 CPU11は、算出した重み付き単語群を、一の単語の参照 確率が所定値以上のグループ、一の単語の参照確率が所定未満のグループの 、ず れに属する力、さらに一の単語の参照確率が所定値以上のグループに属する場合 は、さらに他の単語の参照確率が所定値以上のグループ、他の単語の参照確率が 所定値未満のグループのいずれに属するかを判定しておく。 CPU11は、このような 処理を繰り返して算出した重み付き単語群がいずれのグループに属するかを判定し 、属するグループの識別情報に対応付けて記憶しておく。この索引付けの処理は例 えば、 k-d tree探索アルゴリズムを適用することができる。 [0167] For example, the CPU 11 pays attention to the reference probability of one word in the weighted word group obtained in step S308, and determines whether or not the reference probability of the one word is greater than or equal to a predetermined value. Next, the CPU 11 determines whether or not the reference probability of another word in the weighted word group is a predetermined value or more. CPU11 refers the calculated weighted word group to one word If a group has a probability greater than or equal to a predetermined value, a group with a reference probability of one word less than a predetermined value, and belongs to a group with a reference probability of one word greater than or equal to a predetermined value, then another word It is determined whether the group belongs to a group having a reference probability equal to or higher than a predetermined value or a group having a reference probability of another word lower than a predetermined value. The CPU 11 determines to which group the weighted word group calculated by repeating such processing belongs, and stores it in association with the identification information of the group to which it belongs. For example, a kd tree search algorithm can be applied to this indexing process.
[0168] CPU11は、ステップ S301で取得した文書データ中の全文について各文毎に重み 付き単語群を対応付ける処理を終了したか否かを判断する (ステップ S310)。 CPU1 1は、文書データ中の全文にっ 、て各文毎に重み付き単語群を対応付ける処理を終 了した力否かを以下のように判断する。例えば、現在の文を挟むく su> < Zsu>の 後に、く su>タグが後続する力否かを判断し、後続しないと判断した場合は終端で あると判断することができる。 CPU11がステップ S301で取得した文書データ中の全 文について各文毎に重み付き単語群を対応付ける処理を終了していないと判断した 場合は(S310 :NO)、 CPU11は、処理をステップ S302に戻し、次の文に対して処 理を継続する。一方、 CPU11がステップ S301で取得した文書データ中の全文につ いて各文毎に重み付き単語群を対応付ける処理を終了したと判断した場合は (S31 0 :YES)、 CPU11は、文書データで抽出されて一時記憶領域 14に記憶していた単 語を消去する (ステップ S311 )。  CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to the entire sentence in the document data acquired in step S301 is completed (step S310). The CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to all sentences in the document data is as follows. For example, after su> <Zsu> that sandwiches the current sentence, it is determined whether or not it is followed by a su> tag. If it is determined that it does not follow, it can be determined to be the end. If CPU 11 determines that the process of associating the weighted word group for each sentence is not completed for all sentences in the document data acquired in step S301 (S310: NO), CPU 11 returns the process to step S302. Continue processing for the next sentence. On the other hand, if the CPU 11 determines that the processing for associating the weighted word group for each sentence is completed for the entire sentence in the document data acquired in step S301 (S31 0: YES), the CPU 11 extracts the document data. Then, the word stored in the temporary storage area 14 is deleted (step S311).
[0169] CPU11は、全文書データについて、単語及び単語の参照確率を salience属性に よって記憶する処理を終了した力否かを判断する (ステップ S312)。 CPUl lが全文 書データについて、単語及び単語の参照確率を salience属性によって記憶する処 理を終了していないと判断した場合は(S312 :NO)、 CPU11は、処理をステップ S3 01へ戻し、別の文書データを取得して処理を継続する。 CPU 11が全文書データに っ 、て、単語及び単語の参照確率を salience属性によって記憶する処理を終了した と判断した場合は(S312 :YES)、 CPU11は、単語の参照確率を算出して予め記憶 する処理を終了する。  The CPU 11 determines whether or not the process of storing the word and the word reference probability with the salience attribute is completed for all document data (step S312). If CPU11 determines that the process of storing the word and the word reference probability with the salience attribute has not been completed for all the document data (S312: NO), CPU11 returns the process to step S301 and The document data is acquired and processing continues. If the CPU 11 determines that the processing of storing the word and the word reference probability by the salience attribute is completed for all document data (S312: YES), the CPU 11 calculates the word reference probability and stores it in advance. The memorizing process is terminated.
[0170] 次に、文単位検索装置 1の CPU11が図 9及び図 10のフローチャートに示した処理 を図 5に示した文書データに対して行った場合について具体的に説明する。 [0170] Next, the CPU 11 of the sentence unit search apparatus 1 performs the processing shown in the flowcharts of FIGS. A case will be specifically described in which the above is performed on the document data shown in FIG.
[0171] 図 11は、実施の形態 1における文単位検索装置 1の CPU11が、文書データに示 される文書を文毎に分別した一例を示す説明図である。 FIG. 11 is an explanatory diagram showing an example in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment classifies the document shown in the document data for each sentence.
[0172] 文単位検索装置 1の CPU11は、ステップ S301及びステップ S302の処理により、 文書記憶手段 2で記憶している文書データから、 < su>タグを識別して文毎に分別 する。図 11に示す例では、文は s 「祭とは、神霊などを祀る儀式。」、 s 「祭礼、祭祀 [0172] The CPU 11 of the sentence unit search device 1 identifies <su> tags from the document data stored in the document storage means 2 and separates them for each sentence by the processing of step S301 and step S302. In the example shown in Figure 11, the sentence is s “Festival is a ritual that enshrines spirits, etc.”, s “Festival, ritual
1 2  1 2
とも呼ばれる。」、 S 「九州地方北部では、秋に行われるものに対して(お)くんちと称  Also called. ”, S“ In the northern part of the Kyushu region, it is called (O) kunchi for what happens in the fall.
3  Three
する場合もある。」に分別される。文単位検索装置 1の CPU11によるステップ S303 の処理により、文 s , s , s力も抽出される単語は、単語のリストに記憶された単語と  There is also a case. ”. The word from which the sentences s, s, and s force are also extracted by the processing of step S303 by the CPU 11 of the sentence unit retrieval apparatus 1 is the word stored in the word list.
1 2 3  one two Three
一致する「祭」、「神霊」、「儀式」、「祭礼」、「祭祀」、「九州」、「九州地方」、「九州地方 北部」、「秋」、「くんち」、「場合」である(図 6参照)。  Matching “Festival”, “God Spirit”, “Ritual”, “Festival”, “Ritual”, “Kyushu”, “Kyushu Region”, “North Kyushu Region”, “Autumn”, “Kunchi”, “Case” (See Figure 6).
[0173] 文単位検索装置 1の CPU11は、ステップ S305の処理により、各単語群の文 sで [0173] The CPU 11 of the sentence unit search apparatus 1 uses the sentence s of each word group by the process of step S305.
3 の顕現性 (参照確率)を定量的に求めるために、各単語群の特徴量 dist, gram, ch ainからなる特徴パターンを特定する。例えば、文 sでの「九州」(識別番号: 9714) (  In order to quantitatively determine the manifestation (reference probability) of 3, the feature pattern consisting of the feature quantities dist, gram, and chain of each word group is specified. For example, “Kyushu” (identification number: 9714) in sentence s (
3  Three
図 6参照)の特徴パターンは以下のように特定される。  The characteristic pattern (see Fig. 6) is specified as follows.
[0174] 図 11の説明図に示すように、文 sでの「九州」の distは、最近出現した文 sと、後 [0174] As shown in the explanatory diagram of Fig. 11, the dist of "Kyushu" in the sentence s is the latest appearing sentence s and later
3 3 続の文 sとの距離 1により dist = 1である。また、文 sでの「九州」の gramは、最近「九 3 3 dist = 1 due to the distance 1 from the next sentence s. Also, the gram of “Kyushu” in sentence s has recently been
4 3 4 3
州」が出現した文 Sでは「九州」が係るのは助詞ではなく「地方」へ係るために名詞接  In the sentence S in which “state” appears, “Kyushu” is not a particle but a noun
3  Three
続と特定でき gram 名詞接続である。文 sでの「九州」の chainは、 s力 sまで「九  It can be identified as a gram noun connection. The chain of “Kyushu” in the sentence s
3 1 3 州」が出現した回数は一回であるので chain= lである。したがって、特徴パターン f ( s、九州) = (dist= l, gram=名詞接続, chain= 1)と特定される。したがって、文 The number of occurrences of “3 1 3 states” is one, so chain = l. Therefore, the feature pattern f (s, Kyushu) = (dist = l, gram = noun connection, chain = 1) is specified. Therefore, the statement
3 Three
単位検索装置 1の CPU 11は、図 9及び図 10のフローチャートのステップ S306の処 理により、式(3)に特徴量 dist, gram, chainの値を代入して参照確率を算出する。  The CPU 11 of the unit search device 1 calculates the reference probability by substituting the values of the feature quantities dist, gram, and chain into the equation (3) by the process of step S306 in the flowcharts of FIGS.
[0175] ここで、 gramで表される特徴量の代入値は、文書記憶手段 2で記憶した文書デー タカもサンプル (s, w)を抽出し、夫々に対して算出した単語 wの参照確率を gram毎 に平均値を算出し代入値とする。例えば、抽出したサンプル (s, w)のうち、 gram= ハを有する単語に対して算出した参照確率の平均値が特徴量 gramが「ハ」である場 合に代入する値である。実施の形態 1では、例として、 gram =ハの場合は gram =0 . 0540、 gram=ガの場合は gram=0. 0288、 gram=ノの場合は gram=0. 019 8、 gram=ヲの場合は gram=0. 0179、 gram=-である場合は gram=0. 0124、 gram=名詞接続である場合は、 gram=0. 00352が算出される。 [0175] Here, the substitution value of the feature value represented by gram is obtained by extracting the sample (s, w) from the document data stored in the document storage means 2 and calculating the reference probability of the word w calculated for each. Calculate the average value for each gram and use it as the substitution value. For example, among the extracted samples (s, w), the average value of the reference probabilities calculated for the words having gram = c is a value to be substituted when the feature quantity gram is “c”. In Embodiment 1, as an example, when gram = c, gram = 0 0540, if gram = ga, gram = 0.0288, if gram = no, gram = 0.019 8, if gram = wo, gram = 0.0179, if gram =-, gram = 0 If gram = noun connection, gram = 0.00352 is calculated.
[0176] なお、単語が、助詞「ハ」に係る場合、助詞「ガ」に係る場合、助詞「ノ」に係る場合、 助詞「ヲ」に係る場合での、当該単語が後続の文で出現する参照確率の平均値は、「 ノ、」(主題)「ガ」(主語)「ノ」「ヲ」(目的語)の順に高ぐ当該文での中心である力否か を示す中心化理論で定式ィ匕している主題〉主語〉目的語…の序列とほぼ整合する [0176] It should be noted that when the word relates to the particle "ha", to the particle "ga", to the particle "no", or to the particle "wo", the word appears in the following sentence The average value of the reference probabilities is the centralized theory that indicates whether or not it is the center of the sentence in the order of “no,” (subject), “ga” (subject), “no”, “wo” (object). Almost consistent with the order of subject, subject, object ...
[0177] 文 sでの「九州」の参照確率 (文 sで「九州」が出現又は参照される確率)は、特定 [0177] Reference probability of “Kyushu” in sentence s (probability of “Kyushu” appearing or referenced in sentence s) is specified
3 4  3 4
した特徴量に基づ 、て以下式 (4)のように算出される。  Based on the obtained feature amount, the following equation (4) is calculated.
[0178] [数 4] [0178] [Equation 4]
Pr = Pr =
1 + exp(- 1.425― 0.564 x l + 11.036 x 0.00352 + 3.1 15 x 1)  1 + exp (-1.425― 0.564 x l + 11.036 x 0.00352 + 3.1 15 x 1)
=0.238  = 0.238
[0179] 式 (4)に示したように、文 sでの「九州」の参照確率は 0. 238と算出される。算出さ [0179] As shown in Equation (4), the reference probability of “Kyushu” in the sentence s is calculated as 0.238. Calculated
3  Three
れた参照確率は文 s に対して記憶される。文単位検索装置 1の CPU11は、文 s に  The reference probability is stored for the sentence s. CPU11 of sentence unit search device 1
3 3 対し単語をリストで記憶した識別番号で表し、参照確率を対応付けて記憶する。本発 明では、文の単位を区切るく su>タグに対して属性名 salienceを定義し、属性値は 単語の識別番号及び参照確率の組を羅列したものと定義して以下のように文毎に単 語及び該単語の参照確率 (重み付き単語群)を記憶する。  3 3 On the other hand, the word is represented by an identification number stored in a list, and the reference probability is stored in association with it. In the present invention, the attribute name salience is defined for the su> tag that separates sentence units, and the attribute value is defined as a list of word identification numbers and reference probabilities. Stores the word and the reference probability (weighted word group) of the word.
[0180] < su salience = "単語の識別番号:単語の参照確率 単語の識別番号:単語 [0180] <su salience = "Word identification number: Word reference probability Word identification number: Word
1 1 2 2 の参照確率 単語 3の識別番号:単語 3の参照確率…" >〜< Zsu>  1 1 2 2 Reference Probability Word 3 Identification Number: Word 3 Reference Probability… "> ~ <Zsu>
[0181] 図 12は、実施の形態 1における文単位検索装置 1の CPU11が、参照確率を算出 した結果を付与して文書記憶手段 2に記憶させる文書データの一例を示す説明図で ある。文 sでは「九州」(9714)の参照確率 (文 sでの重み値。以下同様)が 0. 238、 「九州地方北部」(9716)の参照確率が 0. 1159、…と記憶され、後続の文 sでは「 FIG. 12 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search device 1 according to the first embodiment gives the result of calculating the reference probability and stores the result in the document storage unit 2. In sentence s, the reference probability of “Kyushu” (9714) (weight value in sentence s, and so on) is 0.238, The reference probability of “North Kyushu” (9716) is memorized as 0.1159,…
4 九州」(9714)の参照確率力 SO. 238、「祭」(22953)の参照確率力0. 1836、…と記 憶される。文毎に異なる単語及び参照確率の組 (重み付き単語群)が記憶され、文毎 の意味のまとまりを表す情報として検索に使用することができる。文 s及び文 sで、「  4 “Kyushu” (9714), reference probability SO.238, “Festival” (22953), reference probability 0.1836, and so on. Different sentences and sets of reference probabilities (weighted word groups) are stored for each sentence, and can be used for retrieval as information representing a group of meanings for each sentence. In sentence s and sentence s,
3 4 九州」(9716)は、同値の参照確率が算出されているが、文 s ,文 s ,…と続く毎に、  3 4 Kyushu (9716) has the same reference probability, but every time the sentence s, sentence s, ...
5 6  5 6
九州地方に限らな ヽ「祭」につ 、ての記述が続く場合は「九州」の参照確率は次第に 低下していくと考えられる。  If the description of “Festival”, which is limited to the Kyushu region, continues, the reference probability for “Kyushu” will gradually decrease.
[0182] 図 13は、実施の形態 1における文単位検索装置 1の CPU11が、文単位毎に算出 した重み付き単語群を索引付けして記憶した場合のデータベースの内容例を示す 説明図である。なお、図 13の内容例は、図 12の内容例に示した文 s に対応付けら FIG. 13 is an explanatory diagram showing an example of the contents of a database when the CPU 11 of the sentence unit search device 1 according to Embodiment 1 indexes and stores weighted word groups calculated for each sentence unit. . The content example in FIG. 13 is associated with the sentence s shown in the content example in FIG.
4  Four
れる重み付き単語群力 図 9及び図 10のフローチャートに示した CPU11のステップ S309によって索引付けされたデータに相当する。  This corresponds to the data indexed by step S309 of the CPU 11 shown in the flowcharts of FIGS. 9 and 10.
[0183] 図 13に示すように、 CPU 11は重み付き単語群を、いずれのグループに属するかを 示す情報 (k-d treeノード ID)に対応付けて記憶しておく。さらにその際、 CPU11は 、その重み付き単語群カ^、ずれの文書データの文単位に対応付けられて 、るかを特 定できるよう、タグ付け済み文書データのファイル名及び文書データ中の位置 (タグ 情報)を記憶しておく。これにより、後の処理で受け付けた言葉に対して求めた重み 付き単語群と類似する重み付き単語群が対応付けられている文単位を抽出すること が容易になる。 As shown in FIG. 13, the CPU 11 stores the weighted word group in association with information (k-d tree node ID) indicating to which group it belongs. Further, at that time, the CPU 11 identifies the file name of the tagged document data and the position in the document data so that the weighted word group is associated with the sentence unit of the misaligned document data. Remember (tag information). This makes it easy to extract sentence units associated with weighted word groups similar to the weighted word groups obtained for words received in later processing.
[0184] 図 14は、文単位検索装置 1の CPU11により文毎に記憶される単語及び該単語に 対して算出された参照確率の組が、文が続くにつれてどのように変化するかを示す 説明図である。図 14では、文 s 、文 s 、文 s 、文 sと続くにつれて、時系列で文脈が  FIG. 14 shows how the set of words stored for each sentence by the CPU 11 of the sentence unit search apparatus 1 and the reference probabilities calculated for the words change as the sentence continues. FIG. In Figure 14, context continues in time series as sentence s, sentence s, sentence s, sentence s continue.
1 2 3 4  1 2 3 4
動的に変化することに応じて、夫々の文で顕現性の高い単語が夫々異なることが判 る。  It can be seen that the words with high obviousness differ in each sentence according to the dynamic change.
[0185] 4.検索処理  [0185] 4. Search processing
4- 1.ユーザ力 入力された言葉の受け付け  4- 1. User ability Accepting entered words
次に、実施の形態 1における検索処理について説明する。検索処理は、受付装置 4, 4,…でユーザ力 入力されるキーワード又は音声等の言葉を受け付けたことを起 点として開始する。 Next, search processing in the first embodiment will be described. The search process is based on the reception of keywords such as keywords or speech input by the receiving devices 4, 4,. Start as a point.
[0186] 受付装置 4の CPU41は、操作手段 45を介してユーザが入力する文字列を検知し て一時記憶領域 44に記憶する処理、又は音声入出力手段 47を介してユーザが入 力する音声を検知して文字列に変換し一時記憶領域 44に記憶する処理が可能であ る。また、受付装置 4の CPU41はユーザが入力する文字列を解析して一文一文に 分別する機能を有する。例えば、日本語の場合は句点「。」、英語の場合はピリオド「 .」等の所定の文字を識別して分別するのでもよい。また、 Enterキーが押下されたこ とを操作手段 45を介して検知する都度、 Enterキーが入力されるまでの文字列を一 文と分別するのでもよい。ユーザ力もの音声入力に対しては、例えば、音声認識機能 によって音声を文字列に変換し、変換した文字列力 文字列解析によって文に分別 してもよいし、無音を検出したところで文に分別してもよい。受付装置 4の CPU41は、 分別した一文一文をテキストデータとして通信手段 48を介して文単位検索装置 1へ 送信する。  [0186] The CPU 41 of the accepting device 4 detects a character string input by the user via the operation means 45 and stores it in the temporary storage area 44, or a voice input by the user via the voice input / output means 47. Can be detected, converted into a character string, and stored in the temporary storage area 44. The CPU 41 of the accepting device 4 has a function of analyzing a character string input by the user and separating it into one sentence and one sentence. For example, a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified. In addition, each time the Enter key is pressed is detected via the operation means 45, the character string until the Enter key is input may be separated from one sentence. For voice input with user power, for example, the voice may be converted into a character string by the voice recognition function, and may be classified into sentences by the converted character string analysis. May be separated. The CPU 41 of the accepting device 4 transmits the sorted sentences and sentences as text data to the sentence unit retrieval device 1 via the communication means 48.
[0187] 4- 2.受け付けた言葉に対する意味のまとまりの定量ィ匕  [0187] 4- 2. Quantification of a set of meanings for accepted words
次に、文単位検索装置 1の CPU11が、受付装置 4, 4,…で受け付けた言葉を示 すテキストデータを受信した場合に、文書記憶手段 2で記憶している文書中の文を検 索する処理について説明する。受け付けた言葉を示すテキストデータに対しても、意 味のまとまりの定量化、即ち当該テキストデータの単語抽出及び単語の参照確率の 算出を行う。これにより、ユーザが言葉を入力するときにユーザの潜在的な意識にあ る先行の言葉からの流れに応じた文脈を反映した意味のまとまりを表わす情報を、後 述する検索処理における検索要求として自動的に作成することができる。  Next, when the CPU 11 of the sentence unit search device 1 receives text data indicating the words accepted by the accepting devices 4, 4,..., It searches for sentences in the document stored in the document storage means 2. Processing to be performed will be described. For text data indicating accepted words, quantification of meaning groups is performed, that is, word extraction of the text data and calculation of word reference probabilities. As a result, information indicating a group of meanings reflecting the context corresponding to the flow from the preceding words in the user's latent consciousness when the user inputs words can be used as a search request in the search processing described later. Can be created automatically.
[0188] 文単位検索装置 1の CPU11は、ユーザ力 受け付けた言葉を示すテキストデータ をパケット交換網 3及び通信手段 15を介して受付装置 4, 4,…から受信した場合、 一時記憶領域 14に受信した順にテキストデータを記憶すると共に、受信したテキスト データで示される文に対して形態素解析及び統語解析を行う。また、受信したテキス トデータで示された文 sと、文 sより以前に受信したテキストデータで示された文に出現 した単語 wとの対(s, w)に対し、特徴量 dist, gram, chainで表される特徴パターン f (s, w)を特定する。 [0189] 文単位検索装置 1の CPU11は、受信したテキストデータの文 sでの単語 wの特徴 ノターン f (s, w)を特定した場合、特定した特徴パターンと先に得られた回帰式と〖こ 基づいて参照確率を算出する。文単位検索装置 1の CPU11は、各単語について参 照確率を算出し、各単語と各単語について算出した参照確率とを用いて、既に文単 位に対応付けて記憶してある重み付き単語群、即ち各単語と各単語の参照確率との 組と比較する処理をおこなって文単位の検索を行う。 [0188] When the CPU 11 of the sentence unit search device 1 receives text data indicating a word received by the user from the reception devices 4, 4,... Via the packet switching network 3 and the communication means 15, the temporary storage area 14 stores the text data. Text data is stored in the order received, and morphological analysis and syntactic analysis are performed on the sentence indicated by the received text data. For the pair (s, w) of the sentence s shown in the received text data and the word w that appeared in the sentence shown in the text data received before the sentence s , the feature values dist, gram, Specify the feature pattern f (s, w) represented by chain. [0189] When the CPU 11 of the sentence unit retrieval apparatus 1 identifies the characteristic notation f (s, w) of the word w in the sentence s of the received text data, the identified characteristic pattern and the previously obtained regression equation Based on this, the reference probability is calculated. The CPU 11 of the sentence unit search device 1 calculates a reference probability for each word, and uses the word and the reference probability calculated for each word to store a weighted word group that is already stored in association with the sentence unit. In other words, a sentence-by-sentence search is performed by comparing each word with a set of reference probabilities for each word.
[0190] なお、文単位検索装置 1の CPU11は、受付装置 4, 4,…からテキストデータのみ ならず、ユーザ力 入力された発話の音声データも受信することが可能である。この 場合、音声データをテキストデータと同様に音声データに表わされて ヽる単語の文法 上の特徴パターンを特定することにより、同様の処理を行なう。また、音声データの場 合は音声データで得られる特徴を、その単語の顕現性が高 ヽか否かを判断するため の特徴量として扱うことも可能である。例えば、 CPU11は、単語が出現又は参照され た場合に、先行の言葉で出現又は参照されてからの時間差を一つの特徴量として扱 うことができる。また CPU11は、その単語が出現又は参照された直近の先行の言葉 中で、その単語が発声されたときの発話速度及び Z又は音声の周波数を他の特徴 量として扱うことができる。これらは、テキストデータに変換された後では検知すること ができない、時間情報又は単語にこめられた感情を定量的に表わす情報である。  It should be noted that the CPU 11 of the sentence unit search device 1 can receive not only text data but also speech data of utterances input by the user from the reception devices 4, 4,. In this case, the same processing is performed by specifying the grammatical feature pattern of the words expressed in the voice data as in the text data. In the case of speech data, it is also possible to treat features obtained from speech data as features for determining whether or not the word is highly apparent. For example, when a word appears or is referenced, the CPU 11 can treat the time difference from the appearance or reference of the preceding word as one feature quantity. Further, the CPU 11 can treat the speech speed and the Z or speech frequency when the word is uttered as other feature quantities in the latest preceding words where the word appears or is referenced. These are information that quantitatively represents time information or emotion embedded in words that cannot be detected after being converted to text data.
[0191] 受付装置 4がユーザから入力された言葉を受け付けて文単位検索装置 1へ送信し 、文単位検索装置 1の CPU 11が受付装置 4から受信したテキストデータに基づいて 文書記憶手段 2で記憶して 、る文書データから検索を行う処理手順につ!、てフロー チャートを用いて説明する。図 15、図 16、及び図 17は、実施の形態 1における文単 位検索装置 1及び受付装置 4の検索処理の処理手順を示すフローチャートである。  [0191] The accepting device 4 accepts a word input from the user and sends it to the sentence unit retrieval device 1, and the document storage means 2 uses the CPU 11 of the sentence unit retrieval device 1 based on the text data received from the acceptance device 4. A processing procedure for storing and searching from document data will be described with reference to a flow chart. FIG. 15, FIG. 16, and FIG. 17 are flowcharts showing the processing procedure of the search processing of the sentence unit search device 1 and the reception device 4 in the first embodiment.
[0192] 受付装置 4の CPU41は、ユーザによる文字列入力操作を操作手段 45を介して検 知した力否力、又はユーザによる音声入力を音声入出力手段 47を介して検知したか 否かを判断する (ステップ S401)。 CPU41がユーザによる文字列入力操作又は音 声入力を検知していないと判断した場合は(S401 :NO)、 CPU41は、処理をステツ プ S401へ戻し、ユーザによる文字列入力操作又は音声入力を検知するまで待機す る。 [0193] 一方、受付装置 4の CPU41がユーザによる文字列入力操作又は音声入力を検知 したと判断した場合は(S401 :YES)、受付装置 4の CPU41は、入力された文字列 又は音声入力を変換した文字列から、入力された言葉を一文に分別して一時記憶 領域 44に記憶し (ステップ S402)、ユーザ力も入力された言葉をパケット交換網 3を 介して文単位検索装置 1へ送信する (ステップ S403)。 [0192] The CPU 41 of the accepting device 4 determines whether or not the user has detected a character string input operation via the operation means 45, or whether the user has detected a voice input via the voice input / output means 47. Judgment is made (step S401). If the CPU 41 determines that the character string input operation or voice input by the user has not been detected (S401: NO), the CPU 41 returns the process to step S401 and detects the character string input operation or voice input by the user. Wait until [0193] On the other hand, if the CPU 41 of the receiving apparatus 4 determines that the user has detected a character string input operation or a voice input (S401: YES), the CPU 41 of the receiving apparatus 4 receives the input character string or voice input. From the converted character string, the input words are separated into one sentence and stored in the temporary storage area 44 (step S402), and the input words are also transmitted to the sentence unit search device 1 via the packet switching network 3 (step S402). Step S403).
[0194] 文単位検索装置 1の CPU11は、受付装置 4から、ユーザによって入力された言葉 を受信し(ステップ3404)、じ?1;11は、受信した言葉を文として一時記憶領域 14に 受信順にテキストデータで記憶する (ステップ S405)。このとき、テキストデータ毎に 文識別番号を付加して記憶してもよ ヽ。  [0194] The CPU 11 of the sentence unit search device 1 receives the word input by the user from the reception device 4 (step 3404). 1; 11 stores the received words as text in the temporary storage area 14 as text data in the order of reception (step S405). At this time, a sentence identification number may be added to each text data and stored.
[0195] CPU11は、記憶したテキストデータを形態素解析及び統語解析し (ステップ S406 )、解析によって抽出された単語を一時記憶領域 14に記憶する (ステップ S407)。こ のとき CPU11は、リストに記憶してある単語と照合し、リストの識別番号で単語を記憶 する。  The CPU 11 performs morphological analysis and syntactic analysis on the stored text data (step S406), and stores the words extracted by the analysis in the temporary storage area 14 (step S407). At this time, the CPU 11 compares the word stored in the list with the identification number of the list and stores the word.
[0196] なお、文単位検索装置 1のステップ S407における処理により、一時記憶領域 14に は、一連として入力された言葉 (発話)の中で一度は出現又は参照された単語が記 憶されることになる。なお、ステップ S407における単語の抽出は必ずしも行わなくて もよい。その場合は、リストに記憶してある全単語に対し、後述する特徴パターンの特 定の処理を行う。  [0196] Note that, by the processing in step S407 of the sentence unit search device 1, the temporary storage area 14 stores words that have appeared or referred to once in a series of words (utterances) input. become. Note that word extraction in step S407 is not necessarily performed. In that case, a feature pattern specific process described later is performed on all words stored in the list.
[0197] CPU11は、一時記憶領域 14に記憶している単語夫々に対し、過去に受信して記 憶してあるテキストデータ及びステップ S406の形態素解析及び統語解析の結果に 基づいて、特徴パターンを特定する (ステップ S408)。 CPU11は、特定した特徴パ ターンの特徴量を、予め話し言葉について回帰分析して求めた参照確率を算出する ための回帰式に代入し、単語毎に参照確率を算出する (ステップ S409)。 CPU11は 、一時記憶領域 14で記憶して 、る全単語にっ 、て参照確率を算出したか否かを判 断する (ステップ S410)。 CPU11が記憶して 、る全単語にっ 、て参照確率を算出し ていないと判断した場合は(S410 : NO)、処理をステップ S408へ戻し、別の単語に ついて特徴パターンの特定及び参照確率の算出の処理を行う。  [0197] For each word stored in temporary storage area 14, CPU 11 calculates a feature pattern based on the text data received and stored in the past and the results of morphological analysis and syntactic analysis in step S406. Identify (step S408). The CPU 11 substitutes the feature quantity of the identified feature pattern into a regression equation for calculating a reference probability obtained by performing regression analysis on the spoken language in advance, and calculates a reference probability for each word (step S409). The CPU 11 determines whether or not the reference probabilities have been calculated for all the words stored in the temporary storage area 14 (step S410). If the CPU 11 determines that the reference probabilities have not been calculated for all the words stored (S410: NO), the process returns to step S408 to specify the feature pattern and reference probabilities for other words. The calculation process is performed.
[0198] CPU11が記憶している全単語について参照確率を算出したと判断した場合は(S 410 : YES)、一時記憶領域 14に夫々参照確率を算出して記憶している全単語に対 し、所定値以上の参照確率が算出された単語に絞り込む (ステップ S411)。参照確 率が極端に低い単語を除去することにより、後の演算による CPU11自身への負荷を 低減させるためである。 CPU11は、受け付けた言葉に対して絞り込まれた単語及び 単語の参照確率に基づいて以下のような検索処理を行う。 [0198] When it is determined that the CPU 11 has calculated the reference probabilities for all the words stored (S 410: YES), the reference probabilities are calculated and stored in the temporary storage area 14 respectively, and the words having the reference probabilities of a predetermined value or more are narrowed down (step S411). This is to reduce the load on the CPU 11 itself by the subsequent calculation by removing words with extremely low reference probabilities. The CPU 11 performs the following search processing based on the words narrowed down to the accepted words and the word reference probabilities.
[0199] これまでの処理により、受け付けた言葉に対し、以前に受け付けた言葉力 続く流 れ上の意味のまとまりを定量的に表わす単語と単語の参照確率の組 (重み付き単語 群)を検索要求として生成することができた。以下の検索処理 (一点鎖線で囲まれた ステップ S412からステップ S416まで)は、受け付けた言葉に対して得られた重み付 き単語群と、予め記憶してある文単位の重み付き単語群とを比較し、夫々の重み付き 単語群の内の複数の単語の重み値の分布が類似するカゝ否かによって、言葉と文とで 意味が類似する力否かを判定し、類似する文を抽出する処理の一例である。  [0199] With the processing so far, for the accepted words, search for pairs of words and word reference probabilities (weighted word groups) that quantitatively represent the group of semantic meanings that follow the previously accepted word power Could be generated as request. The following search processing (from step S412 to step S416 surrounded by a one-dot chain line) uses a weighted word group obtained for received words and a weight word group for each sentence stored in advance. Comparing and determining whether or not words and sentences have similar meanings based on whether the weight value distributions of multiple words in each weighted word group are similar, and extracting similar sentences It is an example of the process to perform.
[0200] CPU11は、記憶手段 13又は文書記憶手段 2のデータベースから、各文に対応付 けられて記憶されている単語と単語の参照確率との組 (以下重み付き単語群という) を読み出す (ステップ S412)。  [0200] The CPU 11 reads from the database of the storage means 13 or the document storage means 2 a pair of words and word reference probabilities stored in association with each sentence (hereinafter referred to as a weighted word group) ( Step S412).
[0201] このとき、 CPU11は、ある程度類似する重み付き単語群を絞り込んで読み出すこと ができるように、ステップ S411までの処理で得られた受け付けた言葉に対応付けら れる重み付き単語群が、データベースに記憶してある重み付き単語群同様にいずれ のグループに属するかを判定する。 CPU11は、受け付けた言葉に対応付けられた 重み付き単語群が属するグループの重み付き単語群をデータベース力 読み出す。 これにより、全く類似しない重み付き単語群と比較することを回避し、ある程度類似す る重み付き単語群を絞り込んで抽出することができる。  [0201] At this time, the weighted word group associated with the accepted word obtained by the processing up to step S411 is stored in the database so that the CPU 11 can narrow down and read the weighted word group somewhat similar. Similar to the weighted word group stored in, it is determined which group it belongs to. The CPU 11 reads the database power of the weighted word group of the group to which the weighted word group associated with the received word belongs. As a result, it is possible to avoid comparison with weighted word groups that are not similar at all, and to narrow down and extract weighted word groups that are somewhat similar.
[0202] 次に CPU11は、ステップ S412で読み出した重み付き単語群から、受け付けた言 葉の重み付き単語群と同一の単語を含む重み付き単語群を抽出する (ステップ S41 3)。 CPU11は、抽出した文と同一の単語夫々について、参照確率の差分を算出す る(ステップ S414)。 CPUl lは、同一の単語の数の多い順及び同一の単語の参照 確率の差分力 S小さい順に、抽出した重み付き単語群に類似度を付与し (ステップ S4 15)、抽出した重み付き単語群が対応付けられている文を文書集合の文書データか ら読み出す (ステップ S416)。このとき、 CPU11は、類似度が所定値以上の重み付 き単語群のみに対応する文を読み出してもよい。 CPU11は、抽出した文を類似度で ソートする(ステップ S417)。 Next, the CPU 11 extracts a weighted word group including the same words as the weighted word group of the received word from the weighted word group read out in step S412 (step S413). The CPU 11 calculates a reference probability difference for each word that is the same as the extracted sentence (step S414). CPUl l assigns similarities to the extracted weighted word groups in descending order of the number of identical words and the difference in reference probability S of the same words (step S4 15), and the extracted weighted word groups Whether the sentence associated with is a document set document data (Step S416). At this time, the CPU 11 may read a sentence corresponding only to a weighted word group having a similarity equal to or greater than a predetermined value. The CPU 11 sorts the extracted sentences by similarity (step S417).
[0203] 上述のステップ S412からステップ S417までの処理により、受け付けた言葉に対し て得られた重み付き単語群の内の複数の単語の重み値の分布と、類似する重み値 の分布を有する重み付き単語群が対応付けられた文を抽出することができる。  [0203] The weight value distribution of a plurality of words in the weighted word group obtained for the accepted words by the processing from step S412 to step S417 described above, and a weight having a distribution of similar weight values. Sentences with associated word groups can be extracted.
[0204] 次に CPU 11は、各文を表すテキストデータを検索結果のテキストデータとして受付 装置 4へ通信手段 15を介して送信する (ステップ S418)。  Next, the CPU 11 transmits text data representing each sentence as text data of the search result to the accepting device 4 via the communication means 15 (step S418).
[0205] 受付装置 4の CPU41は、検索結果のテキストデータを通信手段 48を介して受信し  [0205] The CPU 41 of the accepting device 4 receives the text data of the search result via the communication means 48.
(ステップ S419)、受信したテキストデータを表示手段 46を介してモニタ等に表示し( ステップ S420)、処理を終了する。  (Step S419), the received text data is displayed on a monitor or the like via the display means 46 (Step S420), and the process is terminated.
[0206] 受付装置 4の CPU41は、ユーザ力 の言葉の入力を検知する都度、一文に分別し たテキストデータ又は音声データを文単位検索装置 1へ送信する。文単位検索装置 1の CPU11は、受付装置 4からテキストデータ又は音声データ、音声データと共に送 信される情報を受信する都度、単語及び単語毎の参照確率を算出して、ユーザから 受け付けた言葉に対し、先行の言葉力 の流れが反映された意味のまとまりを表わ す情報、即ち重み付き単語群を検索要求として作成する。文単位検索装置 1の CPU 11は、受け付けた言葉に対して作成した検索要求 (重み付き単語群)に基づいて記 憶している文書データから文単位を抽出し、検索結果としてテキストデータを送信す る。  [0206] The CPU 41 of the accepting device 4 transmits text data or speech data separated into one sentence to the sentence unit searching device 1 each time an input of a user power word is detected. The CPU 11 of the sentence unit search device 1 calculates a word and a reference probability for each word each time it receives text data or voice data, or information transmitted together with the voice data from the reception device 4, and converts it into a word received from the user. On the other hand, information representing a group of meanings reflecting the flow of preceding word power, that is, a weighted word group is created as a search request. The CPU 11 of the sentence unit search device 1 extracts sentence units from the stored document data based on the search request (weighted word group) created for the accepted words, and sends the text data as the search results. The
[0207] 実施の形態 1における受付装置 4の CPU41は、検索結果のテキストデータを受信 する都度、モニタ等に表示する。したがって、受付装置 4ではユーザ力 言葉が入力 される都度、当該言葉と意味のまとまりが類似するテキストデータが検索結果として表 示される。  [0207] The CPU 41 of the accepting device 4 in the first embodiment displays the text data of the search result on the monitor or the like each time it is received. Therefore, every time a user-spoken word is input, the reception device 4 displays text data similar in meaning to that word as a search result.
[0208] なお、受付装置 4は、必ずしもユーザ力 言葉が入力される都度毎回テキストデー タを送信し、検索結果を受け付けて表示する構成としなくともよい。例えば、所定の期 間中に入力された複数の言葉に相当するテキストデータ又は音声データを文単位検 索装置 1へ送信し、複数の言葉に対応する検索結果を受け付けて表示する構成でも よい。 [0208] Note that the receiving device 4 does not necessarily have to be configured to transmit text data each time a user spoken word is input and to receive and display a search result. For example, a configuration in which text data or voice data corresponding to a plurality of words input during a predetermined period is transmitted to the sentence unit search device 1, and search results corresponding to the plurality of words are received and displayed. Good.
[0209] 図 15、図 16及び図 17のフローチャートに示した文単位検索装置 1の CPU11によ る処理の詳細を具体例を挙げて以下に説明する。  Details of the processing by the CPU 11 of the sentence unit search apparatus 1 shown in the flowcharts of FIGS. 15, 16, and 17 will be described below with specific examples.
[0210] 図 18は、実施の形態 1における文単位検索装置 1の CPU11が、受付装置 4から受 信したテキストデータに対して特定した特徴パターンの例を示す説明図である。図 18 中の文単位 S ,文単位 S ,文単位 Sは夫々、受信した各テキストデータで示され  FIG. 18 is an explanatory diagram showing an example of a feature pattern identified for text data received from the receiving device 4 by the CPU 11 of the sentence unit searching device 1 according to the first embodiment. Sentence unit S, sentence unit S, and sentence unit S in Fig. 18 are indicated by the received text data.
i-2 i-1 i  i-2 i-1 i
る文である。  Is a sentence.
[0211] 図 18中の文単位 Sでの、当該文単位 s及び先行する文単位に含まれる単語「おく んち」とのサンプル(s ,おくんち)の特徴パターンは以下のようにして特定される。現 在の文 s及び先行する文のうち、単語「おくんち」が最近出現又は参照された文 s と i i-2 の距離の特徴量 (dist)は、 dist= 3である。また、単語「おくんち」が最近出現又は参 照された s での「おくんち」が係っている格助詞は「つて」であるため、 gram=ッテで  [0211] The feature pattern of the sample (s, uchichi) with the word s and the word "Okuchi" included in the preceding sentence unit in sentence unit S in Figure 18 is specified as follows: Is done. Among the current sentence s and the preceding sentence, the dist = 3 is the distance feature (dist) between the sentence s and i i−2 where the word “Okunuchi” has recently appeared or was referenced. Also, since the case particle related to “Okunuchi” in s in which the word “Okunuchi” has recently appeared or was referenced is “tsute”, gram = tte
i-2  i-2
ある。更に、文 sより先行の文 s で単語「おくんち」が出現又は参照されたため chain  is there. Furthermore, the word “Okunchi” appeared or was referenced in the sentence s preceding the sentence s.
i i-2  i i-2
= 1である。したがって、特徴パターンは f (s ,おくんち) = (dist = 3, gram=ッテ, c hain= l)と特定される。英語の場合、 gramは前置詞によって特定される。  = 1. Therefore, the feature pattern is specified as f (s, ouchi) = (dist = 3, gram = tte, c hain = l). In English, gram is specified by preposition.
[0212] 文単位検索装置 1では、話し言葉についても文書記憶手段 2で記憶している文書 データについて回帰分析を行い、特徴パターンを特定した場合に特徴量を代入する ことで参照確率を算出することができる回帰式が予め導出されている。したがって、 文単位検索装置 1の CPU 11は、文 sの「おくんち」に対して、特定した特徴パターン の特徴量 dist, gram, chainに基づいて参照確率を算出することができる。更に、文 単位検索装置 1の CPU 11は、文 sについて過去に出現又は参照された単語も含め て参照確率を算出し、単語と単語の参照確率とを求める。文単位検索装置 1の CPU 11は、求めた単語と参照確率とに基づいて、文書記憶手段 2で記憶してある salienc e属性を予め記憶してある文単位から同一の単語の参照確率が所定の値以上である 文単位を直接的に抽出する。文単位検索装置 1の CPU11は、抽出した文を示すテ キストデータを通信手段 15を介して受付装置 4へ送信する。  [0212] In the sentence unit search device 1, for the spoken word, the regression analysis is performed on the document data stored in the document storage means 2, and when the feature pattern is specified, the reference probability is calculated by substituting the feature amount. A regression equation that can be used is derived in advance. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability for the “snoopy” of the sentence s based on the feature quantities dist, gram, and chain of the identified feature pattern. Further, the CPU 11 of the sentence unit search device 1 calculates a reference probability including a word that has appeared or referred to in the past for the sentence s, and obtains a word and a reference probability of the word. The CPU 11 of the sentence unit search device 1 determines the reference probability of the same word from the sentence unit in which the salienc attribute stored in the document storage unit 2 is stored in advance based on the obtained word and the reference probability. A sentence unit that is greater than or equal to is directly extracted. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15.
[0213] このような文単位検索装置 1の CPU11の処理により、受信したテキストデータが表 す言葉の意味のまとまりを当該言葉毎に単語及び単語の参照確率 (重み値)で表す ことができる。また、予め文書記憶手段 2で記憶してある文書データの各文について も、意味のまとまりを表す単語及び単語の参照確率 (重み付き単語群)が記憶される ので、ユーザ力 受け付けた言葉に対し、抽出された単語の参照確率が類似するか 否かによって意味のまとまりが類似する文を直接的に検索することができる。 [0213] By processing of the CPU 11 of the sentence unit search apparatus 1 as described above, the meaning of words represented by the received text data is expressed by word and word reference probability (weight value) for each word. be able to. In addition, for each sentence of the document data stored in advance in the document storage means 2, a word representing a group of meanings and word reference probabilities (weighted word group) are stored. Sentences whose meanings are similar can be directly searched based on whether or not the extracted words have similar reference probabilities.
[0214] (実施の形態 2)  [0214] (Embodiment 2)
実施の形態 2では、事前処理の段階で文書記憶手段 2で記憶した文書データの文 毎に、抽出した単語と単語毎に算出した参照確率との組 (重み付き単語群)を顕現性 ベクトルとして扱う。さらに、受け付けた言葉に対して算出する単語と単語毎に算出し た参照確率との組 (重み付き単語群)も顕現性ベクトルとして扱う。そして検索処理の 段階においては、実施の形態 1に示したように、受け付けた言葉の重み付き単語群 の内の複数の単語の重み値の分布と、予め文毎に対応付けてある重み付き単語群 の内の複数の単語の重み値の分布とが類似する条件にあるか否かを、同一の単語 が記憶されており、同一の単語の差分が小さいか否かで判断した。これに対し、実施 の形態 2では、夫々の重み付き単語群を顕現性ベクトルで表わし、類似する条件に あるカゝ否かを顕現性ベクトル間の距離の短さによって判断する。  In the second embodiment, for each sentence of the document data stored in the document storage means 2 in the pre-processing stage, a pair (weighted word group) of the extracted word and the reference probability calculated for each word is used as the manifestation vector. deal with. Furthermore, a pair (weighted word group) of a word calculated for an accepted word and a reference probability calculated for each word is also treated as a manifestation vector. Then, at the stage of the search process, as shown in the first embodiment, the weight value distribution of the plurality of words in the weighted word group of the accepted words and the weighted word previously associated with each sentence. Whether or not the distribution of weight values of a plurality of words in the group is similar is determined based on whether or not the same word is stored and the difference between the same words is small. On the other hand, in the second embodiment, each weighted word group is represented by a manifestation vector, and whether or not the condition is a similar condition is determined by the shortness of the distance between the manifestation vectors.
[0215] 実施の形態 2における、本発明に係る文単位検索装置 1を用いた検索システムの「 1.ハードウェアの構成及び概要」、及び「2.文書データの取得及び自然言語解析」 については、実施の形態 1と同様であるため説明を省略する。「3.文書データの文 毎の意味のまとまりの定量化」、及び「4.検索処理」について以下に説明する力 実 施の形態 1と同一の符号を用いて説明する。なお、「3.文書データの文毎の意味の まとまりの定量化」、及び「4.検索処理」についても、実施の形態 1と共通する点につ V、ては詳細な説明を省略する。  [0215] Regarding the “1. Hardware configuration and overview” and “2. Document data acquisition and natural language analysis” of the search system using the sentence unit search device 1 according to the present invention in Embodiment 2 Since it is the same as that of the first embodiment, the description is omitted. “3. Quantification of meanings for each sentence of document data” and “4. Search process” will be described using the same reference numerals as those in the first embodiment described below. Note that “3. Quantification of meanings for each sentence of document data” and “4. Search processing” are the same as in Embodiment 1 and will not be described in detail.
[0216] 3.文書データの文毎の意味のまとまりの定量ィ匕  [0216] 3. Quantification of the meaning of each sentence in document data
3 - 1.文毎の意味のまとまりの定義  3-1. Definition of meaning for each sentence
実施の形態 2では、実施の形態 1と同様に文毎の意味のまとまりを定量的に表す情 報は、ユーザが当該文を使用(発話、筆記、聴取又は読解)するときに、ユーザが注 目している単語群と、ユーザが各単語に注目する度合い、即ち顕現性 (salience)を 定量的に示す値 (単語の重み値)とで表す。また、実施の形態 1と同様に、顕現性を 定量的に示す重み値として後続の文で出現する又は参照される確率を示す参照確 率を使用する。 In the second embodiment, as in the first embodiment, information that quantitatively represents a group of meanings for each sentence is used by the user when the user uses the sentence (speaking, writing, listening, or reading). It is expressed as a group of words that the user is interested in, and a value (word weight value) that quantitatively indicates the degree to which the user pays attention to each word, that is, the salience. In addition, as in Embodiment 1, the manifestation Use a reference probability that indicates the probability that it will appear or be referenced in subsequent sentences as a quantitative weighting value.
[0217] 3- 2.回帰モデル学習  [0217] 3- 2. Regression model learning
実施の形態 2でも、参照確率については実施の形態 1の 3— 2.回帰モデル学習と 同様に、文書記憶手段 2で記憶している文書データのサンプルに対する回帰分析に よって得られる回帰係数を含む回帰式を用いて算出する。  Also in the second embodiment, the reference probability includes the regression coefficient obtained by the regression analysis on the sample of the document data stored in the document storage means 2, as in 3-1. Regression model learning of the first embodiment. Calculate using regression equation.
[0218] 3- 3.文単位毎の顕現性の定量ィ匕  [0218] 3- 3. Quantification of manifestation per sentence unit
実施の形態 2でも、文単位検索装置 1の CPU 11は、回帰分析によって得られた回 帰係数を含む回帰式を使用して、抽出された単語毎に特徴量 dist, gram, chainを 特定することで単語毎の参照確率を算出することができる。ここで、単語毎の参照確 率をその単語の重み値として付与した重み付き単語群が得られる。実施の形態 2で は、文毎の意味のまとまりを表わす重み付き単語群は、単語を夫々一次元とし、単語 毎に算出した参照確率を各単語に対応する次元成分の要素として持つ顕現性べタト ルとして扱う。つまり、文書記憶手段 2で記憶される文書データ中の文の意味のまとま りは、文書記憶手段 2で記憶される文書データから抽出し、図 6に示すリストに記憶し ている 31245次元の多次元空間におけるベクトルで表すことができる。  Also in the second embodiment, the CPU 11 of the sentence unit search apparatus 1 uses the regression formula including the regression coefficient obtained by the regression analysis to identify the feature quantities dist, gram, and chain for each extracted word. Thus, the reference probability for each word can be calculated. Here, a weighted word group is obtained by assigning the reference probability for each word as the weight value of the word. In the second embodiment, the weighted word group that represents a group of meanings for each sentence has a one-dimensional word, and has a reference probability calculated for each word as an element of a dimension component corresponding to each word. Treat as a tuttle. That is, the meaning of sentences in the document data stored in the document storage means 2 is extracted from the document data stored in the document storage means 2 and stored in the list shown in FIG. It can be represented by a vector in dimensional space.
[0219] したがって、(あい,あいだ,あいまい, ···, Z, Zくん)という単語群力もなる 31245次 元の基底空間に対し、図 11に示した文 sの顕現性ベクトル v(s )は、文 sでの 9714  [0219] Therefore, for the 31245-dimensional basis space that also has the word group power (Ai, Aida, Ambiguous, ..., Z, Z-kun), the manifestation vector v (s) of the sentence s shown in Fig. 11 9714 in the sentence s
3 3 3 番目の「九州」次元に対応する要素が参照確率の大きさ(重み値) 0. 238で表され、 また、 9716番目の「九州地方北部」次元に対応する要素が参照確率の大きさ 0. 11 59で表されるので、 (0, 0, · ··, 0. 238, 0, 0. 1159, · ··, 0)と 31245次元のベタト ルで表現して扱うことができる。  3 3 The element corresponding to the 3rd “Kyushu” dimension is represented by the magnitude (weight value) of 0.238, and the element corresponding to the 9716th “North Kyushu” dimension is the magnitude of the reference probability. Since it is expressed as 0.111 59, it can be expressed and treated with (0, 0, ..., 0. 238, 0, 0. 1159, ... .
[0220] なお、実施の形態 2において文単位検索装置 1の CPU11が参照確率を算出した 結果を付与して文書記憶手段 2に記憶させる文書データは、実施の形態 1の図 11の 説明図に示した文書データと同様である。即ち、文書記憶手段 2に記憶される文書 データには、次元の番号及び次元成分の要素である参照確率の値が記憶される。 実施の形態 2における文単位検索装置 1の CPU 11が、文書記憶手段 2で記憶して いるタグ付け済みの文書データの文毎に単語の参照確率を算出し、文毎に対応付 けてデータベースに記憶する処理手順は、実施の形態 1と同様であるため説明を省[0220] The document data to be stored in the document storage means 2 with the result of the CPU 11 of the sentence unit search apparatus 1 calculating the reference probability in the second embodiment stored in the document storage means 2 is shown in the explanatory diagram of FIG. 11 of the first embodiment. This is the same as the document data shown. That is, the document data stored in the document storage means 2 stores the dimension number and the value of the reference probability that is an element of the dimension component. The CPU 11 of the sentence unit search apparatus 1 according to the second embodiment calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2 and associates it with each sentence. Since the processing procedure stored in the database is the same as that in the first embodiment, the explanation is omitted.
<o <o
[0221] 4.検索処理  [0221] 4. Search processing
次に、実施の形態 2における検索処理について説明する。「4—1.ユーザから入力 された言葉の受け付け」については、受付装置 4の CPU41が行う処理については実 施の形態 1と同様である。  Next, the search process in the second embodiment will be described. Regarding “4-1. Receiving words input by the user”, the processing performed by the CPU 41 of the receiving device 4 is the same as that of the first embodiment.
[0222] 4- 2.受け付けた言葉に対する意味のまとまりの定量ィ匕  [0222] 4- 2. Quantification of the set of meanings for accepted words
文単位検索装置 1の CPU11が、受付装置 4で受け付けた言葉を示すテキストデー タを受信した場合に、文書記憶手段 2で記憶している文書中の文を検索する処理に ついて説明する。文単位検索装置 1の CPU11は、受け付けた言葉を示すテキストデ ータに対しても、受け付けた言葉の文脈上の意味のまとまりを単語の多次元空間に おける方向性を示す顕現性ベクトルで表す。  A process for searching for a sentence in a document stored in the document storage unit 2 when the CPU 11 of the sentence unit searching apparatus 1 receives text data indicating a word received by the receiving apparatus 4 will be described. The CPU 11 of the sentence unit search apparatus 1 also represents a set of contextual meanings of the accepted words as textual manifestation vectors indicating the directionality in the multidimensional space of the words for the text data indicating the accepted words. .
[0223] 文単位検索装置 1の CPU11は、実施の形態 1での処理同様に、受付装置 4から受 信したテキストデータに対してリストに記憶された 31245次元の単語に対する特徴量 dist, gram, chainで表される特徴パターンを特定する。なお、過去に一連として受 信したテキストデータで出現して 、な 、単語にっ 、ては、対応する次元成分の要素 を 0として特徴パターンの特定を省く。  [0223] As in the processing in Embodiment 1, the CPU 11 of the sentence unit search device 1 uses the feature amounts dist, gram, and 31245-dimensional words stored in the list for the text data received from the reception device 4. Specifies the feature pattern represented by chain. It should be noted that if it appears in the text data received as a series in the past, the feature pattern specification is omitted by setting the corresponding dimension component element to 0 for the word.
[0224] 特徴パターンを表す特徴量 dist, gram, chainから、回帰式に基づいて次元成分 の要素としての参照確率を夫々算出することができる。したがって、文単位検索装置 1の CPU 11は、テキストデータを受信する都度、受信したテキストデータで示される 言葉のそれまでの文脈上の意味のまとまりを表わす顕現性ベクトルを算出することが できる。  [0224] From the feature quantities dist, gram, and chain representing the feature pattern, the reference probabilities as elements of the dimension component can be calculated based on the regression equation. Therefore, each time the text data is received, the CPU 11 of the sentence unit search device 1 can calculate a manifestation vector representing a group of meanings in the context of the word indicated by the received text data.
[0225] 文単位検索装置 1の CPU11は、受け付けた言葉に対して算出した顕現性ベクトル と、文書記憶手段 2で記憶してある、 salience属性を予め付加した文の顕現性べタト ルとの距離をべ外ル演算によって直接算出し、距離が短い文を抽出する。図 6の各 単語を 1次元とした場合の 31245次元の多次元空間の中で意味のまとまりの方向性 が類似する文を検索することができる。文単位検索装置 1の CPU11は、抽出した文 を示すテキストデータを、通信手段 15を介して受付装置 4へ送信する。ベクトル演算 を扱うことが可能なコンピュータを用いる場合は、文毎の意味のまとまりを顕現性べク トルで表して直接的に演算をすることができる。 [0225] The CPU 11 of the sentence unit search device 1 compares the manifestation vector calculated for the received word and the manifestation vector of the sentence with the salience attribute added in advance stored in the document storage means 2. The distance is directly calculated by an outer calculation, and a sentence with a short distance is extracted. Sentences with similar directionality of meanings can be searched in a 31245-dimensional multidimensional space where each word in Fig. 6 is one-dimensional. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15. Vector operation If you use a computer that can handle, you can directly calculate the meaning of each sentence as a manifestation vector.
[0226] 文単位検索装置 1の CPU11が、受付装置 4で検索要求の言葉を示すテキストデー タを受信し、受信したテキストデータに基づ 、て文書記憶手段 2で記憶して 、る文書 データ力も顕現性ベクトルを用いて検索を行う処理手順について説明する。図 19は 、実施の形態 2における文単位検索装置 1及び受付装置 4の検索処理の処理手順を 示すフローチャートである。なお、図 19のフローチャートに示す処理手順では、実施 の形態 1における図 15、図 16及び図 17のフローチャートに示した検索処理の処理 手順と同一の処理については、各ステップに同一の符号を用いて詳細な説明を省略 する。  [0226] The CPU 11 of the sentence unit search device 1 receives the text data indicating the search request word in the receiving device 4, and stores the document data in the document storage means 2 based on the received text data. A processing procedure for performing a search using force manifestation vectors will be described. FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the second embodiment. In the processing procedure shown in the flowchart of FIG. 19, the same reference numerals are used for the same steps as the processing procedures of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. Detailed description is omitted.
[0227] 図 19のフローチャートに示す処理手順の内、一点鎖線で囲まれた各ステップ S50 1からステップ S506までの処理力 実施の形態 1における図 15、図 16及び図 17の フローチャートに示した処理手順と異なる。実施の形態 1におけるステップ S412から ステップ S416までの処理の代わりに、実施の形態 2における文単位検索装置 1の C PU11により実行されるステップ S501からステップ S506までの処理について、以下 に説明する。  [0227] Of the processing procedure shown in the flowchart of FIG. 19, the processing power from step S501 to step S506 surrounded by the alternate long and short dash line The processing shown in the flowchart of FIGS. 15, 16, and 17 in the first embodiment Different from the procedure. Instead of the processing from step S412 to step S416 in the first embodiment, the processing from step S501 to step S506 executed by the CPU 11 of the sentence unit search apparatus 1 in the second embodiment will be described below.
[0228] 文単位検索装置 1の CPU11は、一時記憶領域 14に夫々参照確率を算出して記 憶している全単語に対し、所定値以上の参照確率が算出された単語に絞り込み (ス テツプ S411)、絞り込まれた各単語と、算出された各単語の参照確率とに基づいて 受け付けた言葉の顕現性ベクトルを算出する (ステップ S501)。  [0228] The CPU 11 of the sentence unit search device 1 narrows down to words for which a reference probability equal to or greater than a predetermined value is calculated for all words stored in the temporary storage area 14 by calculating the reference probabilities (steps). S411), the manifestation vector of the accepted word is calculated based on each narrowed word and the calculated reference probability of each word (step S501).
[0229] ステップ S501までの処理により、受け付けた言葉に対し、以前に受け付けた言葉 力 続く流れ上の意味のまとまりを定量的に表わす顕現性ベクトルを検索要求として 生成することができた。以下の処理は、受け付けた言葉に対して得られた顕現性べク トルと、予め記憶してある文毎の顕現性ベクトルとを比較し、夫々の顕現性ベクトルが 表わす各単語の重み値の分布が類似する力否かを判定する処理の一例である。  [0229] Through the processing up to step S501, a manifestation vector that quantitatively represents a group of meanings in the flow following the previously accepted word power can be generated as a search request for the accepted word. The following processing compares the manifestation vector obtained for the accepted word and the manifestation vector of each sentence stored in advance, and calculates the weight value of each word represented by each manifestation vector. It is an example of the process which determines whether it is a force with similar distribution.
[0230] CPU11は、データベースに記憶してある重み付き単語群即ち顕現性ベクトルを読 み出す (ステップ S502)。このとき、ステップ S411までの処理で得られた受け付けた 言葉に対応付けられる顕現性ベクトル力 データベースに記憶してある顕現性べタト ル同様にいずれのグループに属するかを判定する。 CPU11は、受け付けた言葉に 対応付けられた顕現性ベクトルが属するグループの顕現性ベクトルをデータベース から読み出す。これにより、各単語の重み値の分布が類似する顕現性ベクトルをある 程度絞り込んで抽出することができる。 [0230] CPU 11 reads the weighted word group stored in the database, that is, the manifestation vector (step S502). At this time, the obviousness vector force stored in the manifestation vector force database associated with the accepted words obtained in the processes up to step S411 is used. In the same manner as in the above, it is determined to which group it belongs. The CPU 11 reads the manifestation vector of the group to which the manifestation vector associated with the accepted word belongs from the database. As a result, it is possible to narrow down and extract a manifestation vector having a similar distribution of weight values for each word.
[0231] CPU11は、受け付けた言葉に対応付けた顕現性ベクトルと読み出した顕現性べク トルとの距離を算出する (ステップ S503)。 CPU11は、読み出した顕現性ベクトルを 、算出した距離が所定値未満である顕現性ベクトルに絞り込み (ステップ S504)、絞 り込まれた顕現性ベクトルが対応付けられて記憶されて ヽる文を読み出す (ステップ S505) o CPU11は、読み出した文に算出した距離が短い順に類似度を付与する( ステップ S506)。 [0231] CPU 11 calculates the distance between the saliency vector associated with the accepted word and the read saliency vector (step S503). The CPU 11 narrows the read manifest vector to the manifest vector whose calculated distance is less than the predetermined value (step S504), and reads the sentence that is stored in association with the narrowed manifest vector. (Step S505) o The CPU 11 gives similarities to the read sentences in order of increasing calculated distance (step S506).
[0232] 実施の形態 2における文単位検索装置 1の CPU11によるステップ S501からステツ プ S506までの処理により、受け付けた言葉と文脈上の意味合いが類似する文が抽 出される。  [0232] By the processing from step S501 to step S506 by the CPU 11 of the sentence unit searching apparatus 1 in the second embodiment, sentences having similar contextual meaning to the accepted words are extracted.
[0233] その後の抽出された文に対するステップ S417以降の処理は実施の形態 1と同様で ある。  [0233] The processing after step S417 for the extracted sentence is the same as in the first embodiment.
[0234] なお、上述の処理手順の内の、 CPU11が受け付けた言葉に対応付けた顕現性べ タトルと、読み出した顕現性ベクトルとの距離を算出するステップ S503の処理は、具 体的には以下のように算出する。受け付けた言葉 Uに対応付けた顕現性ベクトルが v (u )と表わされ、読み出した顕現性ベクトル力 (s )と表わされる場合、 CPU11は 以下に示す式 (5)のように、コサイン距離を算出する。  [0234] It should be noted that the processing in step S503 for calculating the distance between the manifestation vector associated with the words received by the CPU 11 and the read manifestation vector in the above-described processing procedure is concretely. Calculate as follows. When the manifestation vector associated with the accepted word U is represented as v (u) and the retrieved manifestation vector force (s), the CPU 11 calculates the cosine distance as shown in equation (5) below. Is calculated.
[0235] [数 5]  [0235] [Equation 5]
I ) II ') II) II ') I
Figure imgf000060_0001
[0236] ただし、式 (5)に示したように距離を算出した場合、言葉の顕現性ベクトル )と、 読み出した顕現性ベクトル v (s )とが近いほど、算出したコサイン距離の値は大きくな る。したがって、 CPU11はステップ S506において、算出したコサイン距離が大きい 順に類似度を付与する。
Figure imgf000060_0001
[0236] However, when the distance is calculated as shown in Equation (5), the calculated cosine distance value becomes larger as the word manifestation vector) and the read manifestation vector v (s) are closer. Become. Accordingly, in step S506, the CPU 11 assigns similarities in descending order of the calculated cosine distance.
[0237] このような文単位検索装置 1の CPU11及び受付装置 4の CPU41の処理により、受 け付けた言葉の意味のまとまりを、当該言葉毎に各単語の参照確率を要素とした顕 現性ベクトルで表すことができる。また、予め文書記憶手段 2で記憶してある文書デ 一タの各文にっ 、ても、意味のまとまりを表す各単語の参照確率を要素とした顕現性 ベクトルが記憶してあるため、単語の多次元空間での方向性を表す顕現性ベクトル 間の距離によって、意味のまとまりが類似する文を直接的に検索することができる。  [0237] The processing of the CPU 11 of the sentence unit search device 1 and the CPU 41 of the reception device 4 as described above makes the meaning of the received words a manifestation with the reference probability of each word as an element for each word. It can be expressed as a vector. In addition, since each sentence of the document data stored in advance in the document storage means 2 stores a manifestation vector whose elements are the reference probabilities of each word representing a group of meanings. Sentences whose meanings are similar can be directly searched for by the distance between the manifestation vectors representing the directionality in the multidimensional space.
[0238] (実施の形態 3) [0238] (Embodiment 3)
実施の形態 1又は 2では、事前処理の段階の「3.文書データの文単位毎の意味の まとまりの定量化」を行なう処理の中で、重み付き単語群として当該単語と単語の参 照確率との組、又は顕現性ベクトルを文単位毎に対応付けて記憶しておいた。また、 その後の「4.検索の処理」でも「4 2.受け付けた言葉に対する意味のまとまりの定 量化」の処理の中で、重み付き単語群として単語と単語の参照確率との組、又は顕 現性ベクトルを求めて受け付けた言葉に対応付けた。これに対し、実施の形態 3では In the first or second embodiment, in the process of performing “3. Quantification of meaning group for each sentence unit of document data” in the pre-processing stage, the word and word reference probability as a weighted word group. Or a manifestation vector is stored in association with each sentence unit. Also, in the subsequent “4. Processing of search”, in the processing of “4 2. Quantification of meaning group for accepted words”, a set of words and word reference probabilities as a weighted word group, or manifestation. The actuality vector was matched with the accepted words. In contrast, in Embodiment 3,
、文単位又は言葉毎に対応付けた重み付き単語群 (単語と単語の参照確率との組、 又は顕現性ベクトル)に対し、各単語の顕現性を表わす重み値を、単語に関連の深 い他の単語からの連想を加味して算出し直す処理を実行する。 For weighted word groups (a pair of words and word reference probabilities, or a manifestation vector) associated with sentence units or words, weight values representing the manifestation of each word are deeply related to the word. A recalculation process is performed taking into account associations from other words.
[0239] 具体的に連想とは、文単位毎に対応付けられている重み付き単語群の内のある単 語が、その文単位又は先行の文単位に出現していない場合であっても、その単語と 関連の深 、単語の顕現性が高 、場合はその単語もその文単位で注目されて 、るは ずであることをいう。したがって、一の単語が注目されている時に同時に注目されや すい単語を関連語とする。そして、各単語の顕現性を表わす重み値に、関連の深い 単語の顕現性力 の影響を反映させる。  [0239] Specifically, an association is a case where a word in a group of weighted words associated with each sentence unit does not appear in that sentence unit or the preceding sentence unit. If the word is deeply related to the word and the word is highly apparent, it means that the word is also attracting attention in units of sentences. Therefore, when a single word is attracting attention, a word that is easily noticed at the same time is taken as a related word. Then, the influence of the visibility of closely related words is reflected in the weight value representing the manifestation of each word.
[0240] 図 20は、実施の形態 3における本発明の検索方法に関わる、一の単語と関連の深 い単語の顕現性の影響の概要を示す説明図である。図 20の説明図は、一又は複数 のユーザ間の会話の例を表わしている。会話は発話 u , U , U , Uの集合であり [0240] FIG. 20 is an explanatory diagram showing an overview of the influence of the manifestation of one word and a word closely related to the search method of the present invention in the third embodiment. The explanatory diagram of FIG. Represents an example of conversation between users. A conversation is a set of utterances u, U, U, U
1 2 3 4  1 2 3 4
、 U, U, U, Uの j噴になされている。  , U, U, U, U j has been made.
1 2 3 4  1 2 3 4
[0241] ここで、発話 U , U , U , U にはいずれにも「大阪」は出現していない。また、 U  Here, “Osaka” does not appear in any of the utterances U 1, U 2, U 3, and U. U
1 2 3 4 1 よりも先行の発話で「大阪」が出現しており、発話 U , U , U , U夫々での「大阪」  "Osaka" appeared in the utterance preceding 1 2 3 4 1, and "Osaka" in each utterance U, U, U, U
1 2 3 4  1 2 3 4
の顕現性がゼロではなぐある程度の高さを有していたとしても、その後「大阪」は出 現していないので、発話 uの時点で「大阪」の顕現性を現す参照確率を定量的に算  Even if it has a certain level of visibility that is not zero, “Osaka” has not appeared since then, so the reference probability that reveals the manifestation of “Osaka” at the time of utterance u is calculated quantitatively.
4  Four
出した場合、その値が低下している可能性がある。  If issued, the value may have dropped.
[0242] しかしながら、「大阪」という単語がそれまでの文単位又は言葉に出現していない場 合であっても、発話 U 、U には単語「アメリカ村」及び「ミナミ」が出現している。した [0242] However, even if the word "Osaka" does not appear in the previous sentence unit or word, the words "America Village" and "Minami" appear in Utterance U and U. . did
1 3  13
がって、「アメリカ村」及び「ミナミ」は、発話 uの時点で参照確率を夫々算出した場  Therefore, “American Village” and “Minami” are calculated when the reference probability is calculated at the time of utterance u.
4  Four
合、その値は高いはずである。「アメリカ村」も「ミナミ」も、「大阪」の代表的な繁華街で あるから、発話 uで「大阪」の単語が出現又は参照していなくとも、「アメリカ」又は「ミ  The value should be high. Both “America Village” and “Minami” are representative business districts of “Osaka”, so even if the word “Osaka” does not appear or is referenced in the utterance u, “America” or “Minami”
4  Four
ナミ」が出現していることによって、関連の深い「大阪」の顕現性は本来、高くなるはず である。したがって、図 20の例では、発話 Uにおける「大阪」の顕現性を現す参照確  The appearance of “Nami” should naturally increase the manifestation of “Osaka”, which is closely related. Therefore, in the example of Fig. 20, the reference confirmation showing the manifestation of "Osaka" in Utterance U.
4  Four
率は、高い値を有しているはずである。  The rate should have a high value.
[0243] そこで、実施の形態 3では、文単位又は言葉毎に対応付けられる各単語の顕現性 を表わす重み値を、関連する単語 (関連語)の顕現性を考慮して算出し直す。  [0243] Therefore, in Embodiment 3, the weight value representing the manifestation of each word associated with each sentence or word is recalculated in consideration of the manifestation of the related word (related word).
[0244] 参照確率を関連語の顕現性を考慮した重み値に算出し直すためにはまず、文単 位検索装置 1は、いずれの単語同士の関連が深いの力を表わす情報を先に取得し ておく必要がある。そして次に、文単位毎に算出されている各単語の参照確率に、関 連の深さを表わす関連度の影響を反映しておく。具体的には、例えば上述の例を用 いた場合、「アメリカ村」の「大阪」への関連度を定量的に算出しておく。次に既に算 出されて!/、る「アメリカ村」の参照確率へ、「大阪」への関連度の効果を反映させて、 その文単位での「大阪」の顕現性を表わす重み値として算出し直して記憶しておく。  [0244] In order to recalculate the reference probability to a weight value that takes into account the manifestation of related words, the sentence unit search device 1 first obtains information representing the power of which the words are closely related to each other. It is necessary to keep it. Then, the influence of the relevance representing the depth of the relation is reflected in the reference probability of each word calculated for each sentence unit. Specifically, for example, when the above example is used, the degree of association of “America Village” with “Osaka” is quantitatively calculated. Next, it is calculated as a weight value that represents the manifestation of “Osaka” on a sentence-by-sentence basis by reflecting the effect of the relevance to “Osaka” on the reference probability of “America Village”. Recalculate and store.
[0245] そこで、実施の形態 3ではまず、文単位検索装置 1は、各単語の一の単語への関 連度が重み値として付与された、一の単語に対する重み付き関連語群を作成する。 具体的には、実施の形態 1又は 2において、「3— 3.文単位毎の顕現性の定量化」 の処理によって文単位毎に対応付けられて記憶されている重み付き単語群、即ち単 語と単語の参照確率との組又は顕現性べ外ルを利用して、文単位検索装置 1が各 単語の重み付き関連語群を作成する。文単位検索装置 1は、文書集合全体から抽 出される各単語について、夫々の単語に対する重み付き関連語群を作成し、記憶し ておく。 [0245] Therefore, in the third embodiment, first, the sentence unit search device 1 creates a weighted related word group for one word to which the relevance of each word to one word is given as a weight value. . Specifically, in the first or second embodiment, a weighted word group that is stored in association with each sentence unit by the processing of “3-3. The sentence-by-sentence search apparatus 1 creates a weighted related word group for each word by using the combination of the word and the reference probability of the word or the manifestation rule. The sentence unit search device 1 creates and stores a weighted related word group for each word extracted from the entire document set.
[0246] そして次に、文単位検索装置 1は、文単位毎に対応付けられて記憶されている重 み付き単語群、即ち単語と単語の参照確率との組又は顕現性ベクトルの各単語の参 照確率へ、各単語に関連が深い単語の参照確率力 の影響を、関連度を利用して 反映させ、各単語の重み値を算出し直して記憶する。  [0246] Next, the sentence unit search device 1 stores the weighted word group that is stored in association with each sentence unit, that is, the combination of the word and the word reference probability or each word of the manifestation vector. The influence of the reference probability of words that are closely related to each word is reflected in the reference probability using the degree of association, and the weight value of each word is recalculated and stored.
[0247] さらに、文単位検索装置 1は検索処理において、各言葉に対応付けた重み付き単 語群、即ち単語と単語の参照確率との組又は顕現性ベクトルについても同様に関連 度を利用して各単語の重み値を算出し直す。文単位検索装置 1は、受け付けた言葉 に対応する単語と各単語に対して算出し直した重み値に基づいて、検索処理を行な  [0247] Furthermore, the sentence unit search apparatus 1 similarly uses the degree of relevance for the weighted word group associated with each word, that is, the combination of the word and the word reference probability or the manifestation vector in the search process. Then recalculate the weight value of each word. The sentence unit search device 1 performs a search process based on the word corresponding to the accepted word and the weight value recalculated for each word.
[0248] 以下に、文単位検索装置 1の CPU11が、各単語に対する重み付き関連語群の作 成する処理について、「3— 4.関連語群の作成」の節を追加して説明する。また、作 成された関連語群を使用して、「3— 3.文単位毎の顕現性の定量化」において算出 した参照確率を関連をカ卩味した重み値に算出し直す処理について、「3— 5.連想の 加味した意味のまとまりの定量化」の節を追加して説明する。「4— 2.受け付けた言 葉に対する意味のまとまりの定量化」において算出した参照確率を関連を加味した 重み値に算出し直して検索を実行する処理について、「4 2' .受け付けた言葉に 対する連想を加味した意味のまとまりの定量化」の節を設けて説明する。 [0248] The process of creating a weighted related word group for each word by the CPU 11 of the sentence unit search device 1 will be described below with the addition of the section “3-4. In addition, with regard to the processing to recalculate the reference probabilities calculated in “3-3. Explained by adding the section “3-5. Quantification of a group of meanings with association”. For the processing to re-calculate the reference probability calculated in “4-2. Quantification of the meaning group for accepted words” into a weight value that takes into account the relationship, execute “4 2 '. This section will be explained with the section “Quantification of a group of meanings taking into account the association”.
[0249] なお、実施の形態 3における、本発明に係る文単位検索装置 1を用いた検索システ ムの「1.ハードウェアの構成及び概要」、及び「2.文書データの取得及び自然言語 解析」については、実施の形態 1と同様であるため説明を省略する。「3.文書データ の文毎の意味のまとまりの定量化」、及び「4.検索処理」について以下に説明するが 、実施の形態 1と同一の符号を用いて説明する。なお、「3.文書データの文毎の意 味のまとまりの定量化」、及び「4.検索処理」についても、実施の形態 1と共通する点 につ 、ては詳細な説明を省略する。 [0250] 3-4.関連語群の作成 [0249] In the third embodiment, "1. Hardware configuration and overview" and "2. Document data acquisition and natural language analysis" of the search system using the sentence unit search apparatus 1 according to the present invention. “Is the same as that of the first embodiment, and the description thereof is omitted. “3. Quantification of meaning group for each sentence of document data” and “4. Search process” will be described below, but will be described using the same reference numerals as in the first embodiment. Note that “3. Quantification of meaning set for each sentence of document data” and “4. Search processing” are also omitted from the detailed description of the points in common with the first embodiment. [0250] 3-4. Creation of related terms
関連語群は、図 6で示した説明図で抽出されている全単語について一単語ずつ、 文単位検索装置 1によって以下の処理が行なわれることにより作成される。  The related word group is created by performing the following processing by the sentence unit search device 1 for every word extracted in the explanatory diagram shown in FIG.
[0251] まず、文単位検索装置 1は、「3— 3.文単位毎の顕現性の定量化」で全ての文単 位毎に対応付けられて記憶されている重み付き単語群から、一の単語の参照確率が 所定値以上の重み付き単語群を抽出する。これは、上述のように関連語を、一の単 語が注目されている時に同時に注目されやすい単語とするからであり、一の単語が 注目されて 、な 、文単位が除去されるようにするためである。  [0251] First, the sentence unit retrieval apparatus 1 uses a weighted word group stored in association with every sentence unit in "3-3. Quantification of manifestation per sentence unit". A word group with a weight that has a reference probability of the word or more is extracted. This is because, as described above, the related word is a word that is likely to be noticed at the same time when one single word is noticed, so that the sentence unit is removed when one word is noticed. It is to do.
[0252] 次に文単位検索装置 1は、上述の処理で抽出された、一の単語の参照確率が所定 値以上の重み付き単語群を統合する。具体的には、各重み付き単語群の各単語の 参照確率に、その重み付き単語群に含まれる一の単語の参照確率による重み付け をして各単語の参照確率を平均化する。一の単語の参照確率による重み付けを行う のは、一の単語の参照確率がより高い重み付き単語群の各単語に対する参照確率 を使用するためである。  [0252] Next, the sentence unit search device 1 integrates the weighted word groups having the reference probability of one word that is extracted by the above-described processing and having a predetermined value or more. Specifically, the reference probability of each word in each weighted word group is weighted by the reference probability of one word included in the weighted word group, and the reference probability of each word is averaged. The reason why the reference probability of one word is weighted is that the reference probability for each word in the weighted word group having a higher reference probability of one word is used.
[0253] そして、全単語にっ 、ての重み付き関連語群を同様に扱うため、重み付き関連語 群の各単語の重み値を正規化する。  [0253] Then, in order to treat all weighted related word groups in the same manner, the weight value of each word in the weighted related word group is normalized.
[0254] 以下に、本発明に係る文単位検索方法を実施する文単位検索装置 1の CPU11が 、関連語群を作成する処理について説明する。図 21及び図 22は、実施の形態 3〖こ おける文単位検索装置 1の CPU11が関連語群を作成する処理手順を示すフローチ ヤートである。図 21及び図 22のフローチャートに示す処理は、一の単語について、 その重み値が所定値以上である単語群を抽出する処理、抽出した単語群の各単語 の重み値を統合して関連度として各単語に付与した関連単語群を作成する処理、一 の単語に対応付けて記憶しておく処理、各単語につ!ヽて各処理を実行する処理に 対応する。  [0254] Hereinafter, a process in which the CPU 11 of the sentence unit retrieval apparatus 1 that performs the sentence unit retrieval method according to the present invention creates a related word group will be described. FIG. 21 and FIG. 22 are flowcharts showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment creates a related word group. The processing shown in the flowcharts of FIG. 21 and FIG. 22 includes processing for extracting a word group having a weight value equal to or greater than a predetermined value for one word, and integrating the weight value of each word of the extracted word group as a degree of association. The process of creating a group of related words assigned to each word, the process of storing it in association with a single word, and for each word! Corresponds to the process that executes each process.
[0255] 文単位検索装置 1の CPU11は、記憶手段 13に記憶してあるリストから一の単語を 選択する (ステップ S601)。 CPU11は、文書記憶手段 2から文書集合接続手段 16 を介してタグ付け済みの文書データを取得する (ステップ S602)。 CPU11は、取得 した文書データに付加されたタグく su>を文字列解析によって識別し、文単位を読 み出す (ステップ S603)。次に CPU11は、く su>内に記憶してある salience属性を 読み出し (ステップ S604)、 salience属性に記憶してある単語及び単語の参照確率 の組 (重み付き単語群)の内、ステップ S601で選択した一の単語の参照確率が所定 値以上であるか否かを判断する (ステップ S605)。 [0255] The CPU 11 of the sentence unit search device 1 selects one word from the list stored in the storage means 13 (step S601). The CPU 11 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S602). CPU11 identifies the tag “su>” added to the acquired document data by character string analysis and reads the sentence unit. Extrude (step S603). Next, the CPU 11 reads out the salience attribute stored in su> (step S604), and in step S601 of the set of words and word reference probabilities (weighted word group) stored in the salience attribute. It is determined whether or not the reference probability of the selected one word is greater than or equal to a predetermined value (step S605).
[0256] CPU11が参照確率が所定値未満である (選択した一の単語が対応付けられて!/ヽ ない)と判断した場合(S605 :NO)、 CPU11は、処理をステップ S603へ戻して、後 続の文単位を読み出し(S603)、ステップ S604及びステップ S605の処理を行なう。  [0256] If the CPU 11 determines that the reference probability is less than the predetermined value (the selected word is associated with! / NO) (S605: NO), the CPU 11 returns the process to step S603, Subsequent sentence units are read out (S603), and the processes in steps S604 and S605 are performed.
[0257] CPU11が参照確率が所定値以上であると判断した場合(S605 :YES)、 CPU11 は、ステップ S604で salience属性で読み出した重み付き単語群を一時記憶領域に 記憶する(ステップ S 606)。  [0257] When the CPU 11 determines that the reference probability is equal to or higher than the predetermined value (S605: YES), the CPU 11 stores the weighted word group read out with the salience attribute in step S604 in the temporary storage area (step S606). .
[0258] CPU11は、ステップ S602で取得した文書データの全文単位についてステップ S6 04力もステップ S606までの処理を実行したか否かを判断する(ステップ S607)。 CP Ul 1が全文単位にっ 、て処理を実行して 、な 、と判断した場合 (S607: NO)、 CP U11は、処理をステップ S603へ戻して、後続の文単位を読み出し(S603)、ステツ プ S604からステップ S606までの処理を実行する。  [0258] CPU 11 determines whether or not the processing up to step S606 has also been executed for step S604 for the entire text unit of the document data acquired at step S602 (step S607). If it is determined that CP Ul 1 is in the whole text unit and the process is executed (S607: NO), CCU11 returns the process to step S603 and reads the subsequent sentence unit (S603). The processes from step S604 to step S606 are executed.
[0259] CPU11が全文単位について処理を実行したと判断した場合(S607 :YES)、 CP U11は、全文書データについて、選択した一の単語の参照確率が所定値以上であ る重み付き単語群を抽出した力否かを判断する (ステップ S608)。 CPU11が全文書 データについて選択した一の単語の参照確率が所定値以上である重み付き単語群 を抽出していないと判断した場合(S608 :NO)、 CPU11は、処理をステップ S602 へ戻して次の文書データを取得して(S602)ステップ S603力らステップ S607までの 処理を実行する。  [0259] If the CPU 11 determines that the process has been executed for the whole sentence unit (S607: YES), the CPU 11 determines the weighted word group in which the reference probability of the selected one word is a predetermined value or more for all the document data. It is determined whether or not the force is extracted (step S608). If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is not less than a predetermined value (S608: NO), the CPU 11 returns the process to step S602 and continues to the next step. The document data is acquired (S602), and the processing from step S603 to step S607 is executed.
[0260] CPU11が全文書データについて選択した一の単語の参照確率が所定値以上で ある重み付き単語群を抽出したと判断した場合 (S608 :YES)、 CPU11は、ステップ S606の処理によって抽出され、一時記憶領域 14に記憶してある重み付き単語群の 集合を、夫々での一の単語の参照確率で重み付けした重み値の総和を夫々の単語 に対して算出することにより作成する (ステップ S609)。  [0260] If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is greater than or equal to a predetermined value (S608: YES), the CPU 11 is extracted by the process of step S606. Then, a set of weighted word groups stored in the temporary storage area 14 is created by calculating the sum of weight values weighted by the reference probability of one word for each word (step S609). ).
[0261] CPU11は、ステップ S609において作成した一の単語の参照確率が所定値以上 である重み付き単語群の総和、即ち総和された重み付き単語群の各単語の重み値 を正規化する(ステップ S610)。 [0261] The CPU 11 determines that the reference probability of one word created in step S609 is a predetermined value or more. The sum of the weighted word groups, that is, the weight value of each word of the summed weighted word group is normalized (step S610).
[0262] CPU11は、ステップ S610で正規化された一の単語の参照確率が所定値以上で ある重み付き単語群を、各重み値を関連度とする関連語群としてステップ S601で選 択した一の単語に対応付けて記憶手段 13に、又は文書集合接続手段 16を介して 文書記憶手段 2に記憶する (ステップ S611)。  [0262] CPU 11 selects, in step S601, a weighted word group in which the reference probability of one word normalized in step S610 is greater than or equal to a predetermined value as a related word group having each weight value as a degree of relevance. Is stored in the storage means 13 or in the document storage means 2 via the document set connection means 16 (step S611).
[0263] 次に文単位検索装置 1の CPU11は、記憶手段 13に記憶してあるリストの全単語に ついて関連語群を作成して記憶した力否かを判断する (ステップ S612)。 CPU11が 全単語にっ 、て関連語群を作成して記憶して 、な 、と判断した場合 (S612 : NO)、 CPU11は、処理をステップ S601へ戻して次の一の単語を選択し(S601)、選択し た単語についてステップ S602からステップ S611までの処理を実行する。  [0263] Next, the CPU 11 of the sentence unit search device 1 determines whether or not it has created and stored related word groups for all the words in the list stored in the storage means 13 (step S612). If CPU 11 creates and stores related words for all words and determines that they are not (S612: NO), CPU 11 returns the process to step S601 and selects the next word ( S601), the processing from step S602 to step S611 is executed for the selected word.
[0264] CPU11が全単語にっ 、て関連語群を作成して記憶したと判断した場合 (S612: Y ES)、 CPU11は処理を終了する。  [0264] If the CPU 11 determines that all the words have created and stored related word groups (S612: YES), the CPU 11 ends the process.
[0265] なお、ステップ S605において文単位検索装置 1の CPU11は、単純に、参照確率 が所定値以上であるカゝ否かを判断するのではなぐ以下のような正規ィ匕処理を行つ て力 所定値との比較を行うようにしてもよい。例えば、文単位検索装置 1の CPU11 は、文単位に対応付けられている各単語の参照確率の二乗の総和が「1」になるよう に、全参照確率の二乗和の二乗根で各参照確率を除算することによって正規化を行  [0265] In step S605, the CPU 11 of the sentence unit search device 1 simply performs the following normal process, rather than simply determining whether the reference probability is greater than or equal to the predetermined value. The force may be compared with a predetermined value. For example, the CPU 11 of the sentence unit search device 1 uses the square root of the sum of squares of all reference probabilities so that the sum of the squares of the reference probabilities of each word associated with the sentence unit is “1”. Normalize by dividing
[0266] なお、ステップ S610における正規化についても、各単語の重み値の二乗の総和が[0266] It should be noted that also in the normalization in step S610, the sum of the squares of the weight values of each word is
1になるように正規ィ匕する。例えば、文単位検索装置 1の CPU 11は、全重み値の二 乗和の二乗根により、各重み値を除算することによって正規ィヒを行う。 Regularly enter a value of 1. For example, the CPU 11 of the sentence unit search device 1 performs normality by dividing each weight value by the square root of the sum of squares of all weight values.
[0267] 次に、実施の形態 3における文単位検索装置 1の CPU11が、図 21及び図 22のフ ローチャートに示した処理を一の単語について行った場合に作成される関連語群の 具体例を示す。  [0267] Next, the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment specifies the related word group created when the processing shown in the flowcharts of Figs. 21 and 22 is performed for one word. An example is shown.
[0268] 図 23は、実施の形態 3における文単位検索装置 1の CPU11によって関連語群が 作成される場合の、各処理の過程での重み付き単語群の例を示す説明図である。な お、図 23の説明図に示す例は、文単位検索装置 1の CPU11によって、一の単語「 アメリカ村」の参照確率が所定値 (0.2)以上の重み付き単語群が抽出された場合の 例である。図 23 (a)は、図 21及び図 22のフローチャートに示したステップ S605にお ける CPU11の処理により抽出されて、一時記憶領域 14に記憶されている重み付き 単語群 GW , GW , GWを示している。図 23(b)は、同様にステップ S607におけ FIG. 23 is an explanatory diagram showing an example of a weighted word group in the course of each process when a related word group is created by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment. In the example shown in the explanatory diagram of FIG. 23, the CPU 11 of the sentence unit search device 1 uses the word “ This is an example in which a weighted word group with a reference probability of “America Village” with a predetermined value (0.2) or more is extracted. FIG. 23 (a) shows the weighted word groups GW, GW, GW extracted by the processing of the CPU 11 in step S605 shown in the flowcharts of FIGS. 21 and 22 and stored in the temporary storage area 14. ing. Figure 23 (b) shows the same for step S607.
1 2 3  one two Three
る CPU11の処理により、一の単語の参照確率により重み付けされる重み付き単語群 GW ', GW ', GW ,を示している。図 23(c)は、同様にステップ S609における CP The weighted word group GW ′, GW ′, GW weighted by the reference probability of one word by the processing of the CPU 11 is shown. Figure 23 (c) shows the CP in step S609 as well.
1 2 3 one two Three
U11の処理により、重み付けされて総和された重み付き単語群 GW' 'を示している。  A weighted word group GW ′ ′ weighted and summed up by the processing of U11 is shown.
[0269] 図 23 (a)に示すように、一の単語「アメリカ村」の重み値 (参照確率)が所定値 0.2 以上の重み付き単語群 GW , GW , GWが抽出されている。 As shown in FIG. 23 (a), weighted word groups GW, GW, GW having a weight value (reference probability) of one word “America Village” with a predetermined value of 0.2 or more are extracted.
1 2 3  one two Three
[0270] 図 23(b)に示されている重み付き単語群 GW ', GW ', GW ,の、各単語の重み  [0270] The weight of each word in the weighted word group GW ', GW', GW shown in Fig. 23 (b)
1 2 3  one two Three
値には夫々の重み付き単語群中の一の単語「アメリカ村」の重み値 (参照確率)が乗 算されている。図 23 (a)に示された単語群 GW , GW , GW に対し、図 23(b)に示  The value is multiplied by the weight value (reference probability) of one word “America Village” in each weighted word group. For the word groups GW, GW, and GW shown in Fig. 23 (a), it is shown in Fig. 23 (b).
1 2 3  one two Three
された単語群 GW ', GW ', GW ,の各単語の重み値は、以下のようにして一の単  The weight value of each word of the generated word group GW ′, GW ′, GW is as follows.
1 2 3  one two Three
語「アメリカ村」の重み値 (参照確率)が乗算されている。例えば、重み付き単語群 G Wの各単語の重み値は、アメリカ村の重み値 (参照確率)が 0.6であるため、ァメリ The weight value (reference probability) of the word “America Village” is multiplied. For example, the weight value of each word in the weighted word group G W has an American character because the weight value (reference probability) of America Village is 0.6.
1 1
力村の参照確率で重み付けされて以下のようになる。  Weighted by Rikimura's reference probability, it becomes as follows.
[0271] 単語群 GW ,:(秋: 0(0.6X0),アメリカ村: 0.36(0.6X0.6), ···,大熊座: 0  [0271] Word group GW,: (Autumn: 0 (0.6X0), American Village: 0.36 (0.6X0.6), ..., Okumaza: 0
1  1
(0.6X0),大阪: 0.12(0.6X0.2),大鹿: 0(0.6X0), ···)  (0.6X0), Osaka: 0.12 (0.6X0.2), Oshika: 0 (0.6X0), ...)
[0272] つまり、一の単語「アメリカ村」の重み値が高いほど、他の単語の重み値の影響が反 映される。 That is, as the weight value of one word “America Village” is higher, the influence of the weight value of another word is reflected.
[0273] 図 23 (c)に示されて!/、る重み付き単語群 GW,,の、各単語の重み値は、図 23 (b) に示したように夫々一の単語「アメリカ村」の重み値 (参照確率)で重み付けされた重 み値が単語毎に総和されている。図 23(c)に示された単語群 GW' 'の各単語の重 み値は、図 23(b)に示された単語群 GW ', GW ', GW ,以下のように総和される  [0273] As shown in FIG. 23 (c), the weight value of each word in the weighted word group GW ,! is one word “American Village” as shown in FIG. 23 (b). The weight values weighted by the weight values (reference probabilities) are summed for each word. The weight value of each word in the word group GW ′ shown in FIG. 23 (c) is summed as follows: the word group GW ′, GW ′, GW shown in FIG. 23 (b).
1 2 3  one two Three
[0274] 単語群 GW": (秋: 0.03( = 0 + 0.03 + 0),アメリカ村: 0.49( = 0.36 + 0.09 [0274] Word group GW ": (Autumn: 0.03 (= 0 + 0.03 + 0), American Village: 0.49 (= 0.36 + 0.09
+ 0.04), ···,大熊座: 0( = 0 + 0 + 0),大阪: 0.28( = 0.12 + 0. 12 + 0.0.04) ,大鹿: 0( = 0 + 0 + 0), ···) [0275] また、重み付けされて総和されることにより統合された重み付き単語群 GW' 'の各 単語の重み値は、文単位検索装置 1の CPU11の処理により正規ィ匕される。 + 0.04), ..., Okuma: 0 (= 0 + 0 + 0), Osaka: 0.28 (= 0.12 + 0. 12 + 0.0.04), Oshika: 0 (= 0 + 0 + 0), ... ·) In addition, the weight value of each word in the weighted word group GW ′ ′ integrated by weighting and summing is normalized by the processing of the CPU 11 of the sentence unit search device 1.
[0276] 正規ィ匕の処理についてはその方法は問わないが、例えば、文単位検索装置 1の C PU11は、各単語の重み値を二乗し、二乗した値の和の二乗根を算出し、各単語の 重み値で割って、重み付き単語群 GW' 'の各単語の重み値を正規ィ匕するようにして ちょい。  [0276] Regardless of the method of regularity processing, for example, the CPU 11 of the sentence unit search device 1 squares the weight value of each word, calculates the square root of the sum of the squared values, Divide by the weight value of each word and make sure that the weight value of each word in the weighted word group GW '' is normalized.
[0277] また、重み付けされて総和されることにより統合された重み付き単語群 GW' 'を、各 単語を一次元とし、各単語の重み値を各次元方向の要素として多次元ベクトルであ る関連度ベクトルで表現した場合は、各重み値 (要素)を多次元ベクトルのノルムで 割ることにより、多次元ベクトルを正規ィ匕するようにしてもよい。このとき、ノルムはユー クリツドノルムとは限らない。  [0277] Also, the weighted word group GW '' integrated by weighting and summing is a multidimensional vector with each word as one dimension and the weight value of each word as an element in each dimension. When expressed by a relevance vector, the multidimensional vector may be normalized by dividing each weight value (element) by the norm of the multidimensional vector. At this time, the norm is not necessarily the Euclidian norm.
[0278] このように総和して正規ィ匕した結果の重み付き単語群力 文単位検索装置 1の CP U11により「アメリカ村」の関連語群として作成される。以下に示す例は、単語「ァメリ 力村」の関連語群の一例である。なお、各単語は、重み値の大きい順に列挙されて いる。  [0278] The weighted word group power as a result of summing and normalizing in this way is created as a related word group of "America Village" by the CP U11 of the sentence unit search device 1. The example shown below is an example of a related word group of the word “Ame Rikimura”. Each word is listed in descending order of weight value.
[0279] 関連語群(「アメリカ村」) = (アメリカ村: 0. 647,アメリカ: 0. 369,大阪: 0. 258, 村: 0. 159,防 カメラ: 0. 139,カメラ: 0. 139,チェックアウト: 0. 129,アウト: 0. 129,中: 0. 128,女性: 0. 120,男: 0. 102,中央: 0. 098,犯行: 0. 092,人: 0. 087,たこ焼き: 0. 082,心斎橋: 0. 075,ミナミ: 0. 074,警察: 0. 073,時間: 0. 0 71,公園: 0. 065,昭和: 0. 064,今回: 0. 063,数: 0. 061,なんば: 0. 060,御 津: 0. 060,ランドローバー(登録商標) : 0. 059,ローバー(登録商標) : 0. 059,名 前: 0. 059,プラン: 0. 057,道頓堀: 0. 055,立川: 0. 055,ナンバー: 0. 054,西 鉄: 0. 053,サッ: 0. 052,伊那: 0. 050,オリジナルステッカー: 0. 049,ステツ力 一:0. 049,イン心斎橋:0. 049,御堂筋線:0. 049, · ··)  [0279] Related words (“American Village”) = (American Village: 0. 647, USA: 0. 369, Osaka: 0. 258, Village: 0.159, Defense Camera: 0. 139, Camera: 0. 139, Checkout: 0. 129, Out: 0. 129, Medium: 0.128, Female: 0.120, Male: 0.102, Center: 0. 098, Crime: 0.092, Person: 0. 087 , Takoyaki: 0. 082, Shinsaibashi: 0. 075, Minami: 0. 074, Police: 0. 073, Time: 0. 0 71, Park: 0. 065, Showa: 0. 064, This time: 0.0.63, Number: 0.061, Namba: 0. 060, Mitsu: 0. 060, Land Rover (registered trademark): 0. 059, Rover (registered trademark): 0. 059, Name: 0. 059, Plan: 0. 057, Dotonbori: 0. 055, Tachikawa: 0. 055, Number: 0. 054, Nishitetsu: 0. 053, Sat: 0. 052, Ina: 0. 050, Original sticker: 0. 049, Stet power One: 0. 049, Inn Shinsaibashi: 0.049, Midosuji Line: 0.049, ···)
[0280] なお上の例は、文書集合 (GDAタグ付き毎日新聞コーパス http:〃 www.gsk.or.jp/c atalog.html参照)を使用して実際に作成した「アメリカ村」の関連語群である。  [0280] The above example is a related word of “America Village” actually created using a document set (GDA tagged daily newspaper corpus http: 〃 www.gsk.or.jp/catalog.html). A group.
[0281] 上述の「アメリカ村」の関連語群の具体例に示したように、例えば、「アメリカ村」が注 目されている場合、「大阪」は他の単語よりも注目される関連語であることを重み値に よって定量的に表わすことができる。したがって、この関連語群の各単語の重み値は 一の単語への関連度を表わしているということができる。上述の具体例では「アメリカ 村」の「大阪」への関連度は、 0. 258である。 [0281] As shown in the specific example of the related term group of “America Village”, for example, when “America Village” is focused, “Osaka” is a related term that attracts more attention than other words. To the weight value Therefore, it can be expressed quantitatively. Therefore, it can be said that the weight value of each word of this related word group represents the degree of relevance to one word. In the above specific example, the degree of association of “America Village” with “Osaka” is 0.258.
[0282] 以下、単語 wに対して作成した関連語群の各重み値、即ち単語 wの単語 wへの [0282] Hereinafter, each weight value of the related word group created for the word w, that is, the word w to the word w
j j k 関連度を b と表わす。一の単語 wの関連語群は bw = (w : b 、w : b 、 " '、w :  j j k Describe the relevance as b. The related word group of one word w is bw = (w: b, w: b, "', w:
j,k j j 1 j,l 2 j,2 n b )と表わされる。なお、関連語群を関連度ベクトルとして表わす場合、 bw = (b 、 b j, k j j 1 j, l 2 j, 2 n b). When representing related terms as relevance vectors, bw = (b, b
J,n J J,lJ, n J J, l
、一、b )と表現する。 , One, b).
j,2 j,n  j, 2 j, n
[0283] 文単位検索装置 1の CPU11は、上述のような処理を、図 6の説明図に示した全単 語について繰り返し行って各単語の関連単語群を作成し、文書記憶手段 2又は文単 位検索装置 1の記憶手段 13に記憶しておく。このように、文書集合に出現する単語 全てについて夫々関連度が定量的に算出されて付与された関連語群を作成して記 憶しておくことにより、文単位毎の意味のまとまりを表わす重み付き単語群に対し、関 連語の関連度による影響を反映させることができる。  [0283] The CPU 11 of the sentence unit search device 1 repeats the above-described process for all the words shown in the explanatory diagram of FIG. 6 to create a related word group for each word, and creates the document storage means 2 or the sentence. It is stored in the storage means 13 of the unit search device 1. In this way, weights that represent a group of meanings for each sentence unit are created and stored by creating a related word group in which the degree of association is quantitatively calculated for each word that appears in the document set. It is possible to reflect the influence of related words on the attached word group.
[0284] 3- 5.連想をカ卩味した意味のまとまりの定量ィ匕  [0284] 3- 5. Quantification of meanings based on associations
次に、文単位毎に記憶されている重み付き単語群、即ち単語と各単語の参照確率 との組又は顕現性ベクトルに、作成された関連語群の各単語の関連度を反映させる 。具体的には、文単位検索装置 1は、既に算出されて記憶されている各単語の参照 確率を読み出し、一の単語の重み値として、各単語の参照確率に各単語から一の単 語への関連度を乗算した値を算出し直して記憶する。  Next, the degree of relevance of each word of the created related word group is reflected in the weighted word group stored for each sentence unit, that is, the set of the word and the reference probability of each word or the manifestation vector. Specifically, the sentence unit search device 1 reads the reference probability of each word that has already been calculated and stored, and uses each word's reference probability as a single word weight value as a single word weight value. A value obtained by multiplying the degree of relevance is recalculated and stored.
[0285] 図 24は、実施の形態 3における文単位検索装置 1の CPU11が、各文単位に対応 付けられて記憶されている重み付き単語群の各単語の重み値を算出し直す処理手 順を示すフローチャートである。図 24のフローチャートに示す処理は、文単位毎に対 応付けられた重み付き単語群の各単語の重み値を、関連度を使用して付与し直す 処理に対応する。  FIG. 24 shows a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows. The process shown in the flowchart of FIG. 24 corresponds to the process of reassigning the weight value of each word of the weighted word group associated with each sentence unit using the degree of association.
[0286] 文単位検索装置 1の CPU11は、文書記憶手段 2から文書集合接続手段 16を介し てタグ付け済みの文書データを取得する(ステップ S71)。 CPU11は、取得した文書 データに付加されたタグく su>を文字列解析によって識別し、文単位を読み出す( ステップ S72)。 [0287] 次に CPU11は、く su>内に記憶してある salience属性を読み出し (ステップ S73) 、 salience属性で対応付けて記憶してある単語及び単語の参照確率の組 (重み付き 単語群)の、各参照確率を関連語群を使用して連想を加味した重み値に算出し直す (ステップ S74)。 CPU11は、各単語及び各単語についてステップ S74で算出し直し た重み値の組である重み付き単語群 (顕現性ベクトル)を salience属性を付加して記 憶し直す (ステップ S 75)。 The CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage unit 2 via the document set connection unit 16 (step S71). The CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis, and reads out the sentence unit (step S72). [0287] Next, the CPU 11 reads the salience attribute stored in <su> (step S73), and stores the word and word reference probability pair (weighted word group) stored in association with the salience attribute. Each of the reference probabilities is recalculated to a weight value that takes the association into account using the related word group (step S74). The CPU 11 re-stores each word and a weighted word group (a manifestation vector), which is a set of weight values recalculated in step S74, with the salience attribute added (step S75).
[0288] 次に CPU11は、ステップ S72で読み出した文単位が文書データの終端であるか 否かを判断する (ステップ S76)。現在の文が取得した文書データの終端である力否 力は、現在の文を挟むく su> < /su>の後に、く su>タグが後続するかしないか を判断し、後続しないと判断した場合は終端であると判断することができる。 CPU11 が文書データの終端でないと判断した場合は(S76 :NO)、 CPU11は、処理をステ ップ S72に戻し、次の文単位に対して処理を継続する。一方、 CPU11が文書データ の終端であると判断した場合は(S76 :YES)、 CPU11は、全文書データについて、 重み付き単語群の各単語の重み値を算出し直して salience属性で対応付けて記憶 する処理を終了した力否かを判断する (ステップ S77)。  Next, the CPU 11 determines whether or not the sentence unit read in step S72 is the end of the document data (step S76). Whether the current sentence is the end of the acquired document data is determined by whether or not the su> tag follows the su> </ su> that sandwiches the current sentence. If so, it can be determined that it is the end. If the CPU 11 determines that it is not the end of the document data (S76: NO), the CPU 11 returns the processing to step S72 and continues the processing for the next sentence unit. On the other hand, if the CPU 11 determines that the end of the document data (S76: YES), the CPU 11 recalculates the weight value of each word in the weighted word group and associates it with the salience attribute for all document data. Judgment is made as to whether or not the processing to be memorized is completed (step S77).
[0289] CPU11が全文書データについて、重み付き単語群の各単語の重み値を算出し直 して salience属性によって記憶する処理を終了して 、な 、と判断した場合は(S77: NO)、 CPU11は、処理をステップ S71へ戻し、別の文書データを取得して処理を継 続する。 CPU 11が全文書データについて、重み付き単語群の各単語の重み値を算 出し直して salience属性によって記憶する処理を終了したと判断した場合は(S77: YES)、 CPU11は処理を終了する。  [0289] If the CPU 11 determines that all the document data has been processed by recalculating the weight value of each word of the weighted word group and storing it with the salience attribute (S77: NO), The CPU 11 returns the process to step S71, acquires another document data, and continues the process. When the CPU 11 determines that the processing for recalculating the weight value of each word in the weighted word group and storing it with the salience attribute is completed for all document data (S77: YES), the CPU 11 ends the processing.
[0290] なお、文単位検索装置 1の CPU11は、ステップ S 74における各単語の重み値の算 出し直しを以下のような処理を行なうことによって実現する。  [0290] Note that the CPU 11 of the sentence unit retrieval apparatus 1 realizes the recalculation of the weight value of each word in step S74 by performing the following processing.
[0291] 図 25は、実施の形態 3における文単位検索装置 1の CPU11が、各文単位に対応 付けられて記憶されている重み付き単語群の各単語の重み値を算出し直す処理手 順の詳細を示すフローチャートである。図 25のフローチャートに示す処理は、各単語 の関連度を重み付き単語群の重み値に乗算する処理、乗算した重み値に基づいて 各単語の重み値を付与し直す処理に対応する。 [0292] 文単位検索装置 1の CPU11は、図 24のフローチャートのステップ S74で読み出し た salience属性で対応付けて記憶してある重み付き単語群の各単語及び各単語の 参照確率を読み出し、一時記憶領域 14に記憶しておく(ステップ S81)。 CPU11は 、各単語の内の一の単語を選択し (ステップ S82)、選択した一の単語の重み値につ いて以下の処理を行なう。 FIG. 25 is a processing procedure in which the CPU 11 of the sentence unit search device 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows the detail of. The process shown in the flowchart of FIG. 25 corresponds to a process of multiplying the relevance level of each word by the weight value of the weighted word group, and a process of reassigning the weight value of each word based on the multiplied weight value. [0292] The CPU 11 of the sentence unit search device 1 reads each word of the weighted word group stored in association with the salience attribute read in step S74 of the flowchart of Fig. 24 and the reference probability of each word, and temporarily stores them. Stored in area 14 (step S81). The CPU 11 selects one of the words (step S82), and performs the following processing for the weight value of the selected word.
[0293] CPU11は、記憶手段 13又は文書記憶手段 2に記憶してある各単語の関連度が付 与された関連語群を読み出す (ステップ S83)。 CPU11は、読み出した各単語の関 連語群から、各単語から一の単語への関連度を取得する (ステップ S84)。 CPU 11 は、取得した各単語から一の単語への関連度を一時記憶領域 14に記憶してある各 単語の参照確率に夫々乗算し、和を算出する (ステップ S85)。  [0293] The CPU 11 reads the related word group to which the relevance level of each word stored in the storage means 13 or the document storage means 2 is given (step S83). The CPU 11 acquires the degree of relevance from each word to one word from the related word group of each read word (step S84). The CPU 11 multiplies the obtained degree of association from each word to one word by the reference probability of each word stored in the temporary storage area 14, and calculates the sum (step S85).
[0294] CPU11によりステップ S85で算出された和力 一の単語について、関連語による 連想が加味されて算出し直された顕現性を表わす重み値である。  [0294] This is a weight value representing the manifestation, which is recalculated by taking into account the associations of related words, for the single word calculated by the CPU 11 in step S85.
[0295] CPU11は、ステップ S81で一時記憶領域 14に記憶してある各単語全てについて 、重み値を算出し直したか否かを判断する (ステップ S86)。 CPU11が各単語全てに ついて重み値を算出し直していないと判断した場合(S86 :NO)、 CPU11は、処理 をステップ S82へ戻して、次の単語につ!、てステップ S82からステップ S85までの重 み値を算出し直す処理を実行する。 CPU 11が各単語全てにつ 、て重み値を算出し 直したと判断した場合(S86 :YES)、 CPU11は、処理を図 24のフローチャートのス テツプ S75へ戻す。  The CPU 11 determines whether or not the weight value has been recalculated for all the words stored in the temporary storage area 14 in step S81 (step S86). If CPU 11 determines that the weight value has not been recalculated for all the words (S86: NO), CPU 11 returns the process to step S82 and moves to the next word !, from step S82 to step S85. The process of recalculating the weight value is executed. If the CPU 11 determines that the weight value has been recalculated for each word (S86: YES), the CPU 11 returns the process to step S75 of the flowchart of FIG.
[0296] なお、図 24のフローチャートの内のステップ S74及び図 25のフローチャートに示し た文単位検索装置 1の CPU 11による重み値を算出し直す処理は、実施の形態 1に おける参照確率を算出して各文単位毎の顕現性を現す重み値として記憶する処理 の中で実行してもよい。具体的には、図 9のフローチャートに示した処理手順の内の ステップ S306とステップ S307の処理の間にステップ S74及び図 25のフローチャート に示した処理を実行する構成でもよ ヽ。  Note that the processing for recalculating the weight value by the CPU 11 of the sentence unit search device 1 shown in the flowchart of FIG. 24 and step S74 in the flowchart of FIG. 24 is to calculate the reference probability in the first embodiment. Then, it may be executed in the process of storing it as a weight value representing the manifestation of each sentence unit. Specifically, the configuration may be such that the processing shown in the flowchart of FIG. 25 and step S74 is executed between the processing of step S306 and step S307 in the processing procedure shown in the flowchart of FIG.
[0297] 図 24及び図 25のフローチャートに示した CPU11の処理手順において、文単位検 索装置 1の CPU11が、各単語について算出した参照確率を連想をカ卩味した重み値 に算出し直す処理について、具体的な例を以下に示す。 [0298] 例えば、単語「アメリカ村」について作成した関連度群を使用する場合、文単位検 索装置 1により、ある文単位における「大阪」の顕現性を現す重み値を以下のように算 出し直す。なお、「アメリカ村」について作成した関連度群の「大阪」への関連度は「0 . 3」であるとする。ある文単位に対応付けて記憶されている単語に「アメリカ村」が含 まれており、「アメリカ村」の参照確率が 0. 4であり、「大阪」は含まれていない場合で あっても、文単位検索装置 1の CPU 11は、「アメリカ村」の参照確率 0. 4に、「アメリカ 村」から「大阪」への関連度 0. 3を乗算して、その文単位における「大阪」の重み値は 「0」ではなく「0. 12」に算出し直す。 [0297] In the processing procedure of the CPU 11 shown in the flowcharts of Figs. 24 and 25, the CPU 11 of the sentence unit search device 1 recalculates the reference probability calculated for each word to a weight value that reflects the association. A specific example is shown below. [0298] For example, when using the relevance group created for the word "America Village", the sentence unit search device 1 calculates the weight value representing the manifestation of "Osaka" in a sentence unit as follows. cure. It is assumed that the relevance group created for “America Village” is “0.3” for “Osaka”. Even if a word stored in association with a sentence unit contains "America Village", the reference probability of "America Village" is 0.4, and "Osaka" is not included. The CPU 11 of the sentence unit search device 1 multiplies the reference probability 0.4 of “America Village” by the relevance level 0.3 from “America Village” to “Osaka” to obtain “Osaka” in the sentence unit. The weight value of is recalculated to “0.12” instead of “0”.
[0299] ここで、文脈連想を加味した単語 wの各文 sにおける顕現性を表わす重み値を、 s  [0299] Here, s is the weight value representing the manifestation in each sentence s of the word w with contextual association.
k i  k i
alienee (w | pre (s;) )と表わす。また、単語 wの各文 sにおける参照確率を Pr (w  It is expressed as alienee (w | pre (s;)). Also, the reference probability of each word s in the word w is expressed as Pr (w
k i k i k k i k i k
I pre (s;) )とする。この場合、単語 wの単語 wへの関連度を反映した場合、 salien i j k I pre (s;)). In this case, if the relevance of word w to word w is reflected, salien i j k
ce (w I pre (s ) ) =b X Pr (w | pre (s;) )と算出し直される。なお、単語 wへの関 k j,k j k 連度を有する単語 wは他にも存在するので、全単語 w (j = l, · · ·, N)からの関連度  ce (w I pre (s)) = b X Pr (w | pre (s;)) is recalculated. Note that there are other words w with relation k j, k j k to word w, so the degree of association from all words w (j = l, ···, N)
j J  j J
の影響をも反映させて、文単位検索装置 1は以下に示す式 (6)のように各単語の重 み値を算出し直す。  The sentence unit search apparatus 1 recalculates the weight value of each word as shown in the following formula (6).
[0300] [数 6] [0300] [Equation 6]
N N
salience{wk | pre ( )) = ^bj,k x PrOゾ | pre ( '))salience {w k | pre ()) = ^ b j, k x Pr O zo | pre ('))
=1  = 1
[0301] したがって、文単位検索装置 1の CPU 11は、以下に示す式(7)のように文単位 S における各単語 w (k= l, · · ·, N)の重み値を算出し直す。 [0301] Therefore, the CPU 11 of the sentence unit retrieval apparatus 1 recalculates the weight value of each word w (k = l, ···, N) in the sentence unit S as shown in the following equation (7). .
k  k
[0302] [数 7] V(Sj ) = alience{w\ | pre(^ )),···, salienc^w^ | pre ( )),·'·, salience(wN | pre ( [0302] [Equation 7] V (Sj) = alience {w \ | pre (^)), ···, salienc ^ w ^ | pre ()), ··· , salience (w N | pre (
Figure imgf000073_0001
Figure imgf000073_0001
= ( '… ' ', w ) ) (7 )  = ('…' ', W)) (7)
[0303] なお、式(7)の最終行の式は、実施の形態 2に示したように、重み付き単語群、即ち 単語と単語の参照確率との組を顕現性ベクトル v(s )として表現した場合に、 salienc e (w I pre (s ) )を k番目の要素として有する連想をカ卩味した後の顕現性ベクトル V ( k [0303] Note that, as shown in the second embodiment, the expression in the last line of the expression (7) is a weighted word group, that is, a pair of a word and a word reference probability as a manifestation vector v (s). In this case, the manifestation vector V (k after resolving the association with salienc e (w I pre (s)) as the k-th element
s )の各単語の重み値の算出の原理を表わす。  s) represents the principle of calculating the weight value of each word.
[0304] この場合、各 bw , ···, bw は、全単語 w , ···, w に対する関連語群をベクトルに [0304] In this case, each bw, ..., bw is a vector of related words for all words w, ..., w
1 N 1 N  1 N 1 N
よって表現した関連度ベクトルである。  Therefore, the relevance vector expressed.
[0305] 重み付き単語群、即ち単語と単語の参照確率との組を多次元ベクトル V (s )で表現 し、関連語群を関連度ベクトル bw , ···, bw で表現した場合、式(7)のように各単語 [0305] When a set of weighted words, that is, a set of words and word reference probabilities, is expressed by a multidimensional vector V (s), and related words are expressed by relevance vectors bw,. Each word as in (7)
1 N  1 N
の参照確率を、連想を加味した重み値に算出し直す処理は、以下のように解釈する ことができる。  The process of recalculating the reference probabilities to the weight values that take the associations into account can be interpreted as follows.
[0306] salience (w | pre (s ) )を k番目の要素として有する、連想を加味した顕現性べク k i  [0306] salience (w | pre (s)) as the k-th element, k i
トル V(s )は、関連度ベクトル bw , ···, bw を基底とする斜交座標系における顕現 i 1 N  Toru V (s) is the manifestation i 1 N in the oblique coordinate system based on the relevance vector bw, ..., bw
性ベクトル v(s )であると解釈することができる。言い換えると、連想を加味した顕現性 ベクトル V(s )は、参照確率をそのまま要素とする顕現性ベクトル v(s )を関連語軸方 向へ回転させたものであると解釈することができる。  It can be interpreted as a sex vector v (s). In other words, the manifestation vector V (s) taking into account the association can be interpreted as the manifestation vector v (s) with the reference probability as an element as it is rotated in the direction of the related word axis.
[0307] 関連度ベクトル bw , ···, bw を基底とする斜交座標系とは、連想をカ卩味した各単  [0307] The oblique coordinate system based on the relevance vectors bw,..., Bw is each unit that reflects the association.
1 N  1 N
語を 1次元とした場合に、各基底ベクトル (各単語の次元方向に大きさ 1のベクトル) は、夫々直行せず関連度が高い単語同士の基底ベクトル間の角度が小さくなるよう な座標系である。 [0308] b を各要素とする変換行列を参照確率を要素とする顕現性ベクトルに乗算すると、 j,k When the word is one-dimensional, each base vector (vector of size 1 in the direction of each word dimension) is a coordinate system in which the angle between the base vectors of words that are not related to each other and are highly related is small. It is. [0308] When the transformation matrix with each element of b is multiplied by the manifestation vector with the reference probability as the element, j, k
関連する単語の次元方向に回転した顕現性ベクトル V )が得られると解釈すること ができる。  It can be interpreted that the manifestation vector V) rotated in the dimension direction of the related word is obtained.
[0309] したがって、文毎の意味のまとまりを表わす重み付き単語群を顕現性ベクトルで表 現して記憶している場合、文単位検索装置 1の CPU11がその顕現性ベクトルを関連 度ベクトルによって回転 (変換)する処理を行なうことによって、文毎の意味のまとまり を連想が加味された顕現性ベクトルで表わして記憶しておくことができる。  [0309] Therefore, when a weighted word group representing a group of meanings for each sentence is expressed and stored as an explicit vector, the CPU 11 of the sentence unit search device 1 rotates the manifest vector by the relevance vector ( By performing the process of conversion, a group of meanings for each sentence can be expressed and stored as a manifestation vector with associations added.
[0310] 次に、上述のように定量的に関連度を表わした関連度群を使用して、各文単位の 意味のまとまりを表わす各単語の重み値を連想を加味して算出し直す処理を実行し た結果の具体例を以下に示す。図 26は、実施の形態 3における文単位検索装置 1 の CPU 11によって算出された各単語の顕現性を表わす重み値の内容例を示す説 明図である。図 26 (a)に示した各文 s , s に対する各単語の重み値は夫々、関連語  [0310] Next, using the relevance group that quantitatively represents the relevance as described above, a process of recalculating the weight value of each word that represents a group of meanings of each sentence unit in consideration of association A specific example of the results of executing is shown below. FIG. 26 is an explanatory diagram showing an example of the contents of weight values representing the manifestation of each word calculated by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment. The weight value of each word for each sentence s, s shown in Fig. 26 (a) is the related word.
1 2  1 2
群を使用して連想が加味される前の参照確率の値である。一方、図 26 (b)に示した 各文 s , s に対する各単語の重み値は、関連語群を使用して連想が加味された後の The value of the reference probability before the association is taken into account using the group. On the other hand, the weight value of each word for each sentence s, s shown in Fig. 26 (b) is the value after association is considered using the related word group.
1 2 1 2
重み値である。  It is a weight value.
[0311] なお、図 26に示す具体例は、日本語話し言葉コーパス(http:〃 www. kokken.go.jp/ katsudo/kenkyujyo/corpus八 CSjZvoll7ZD03F0040)より抽出した文単位の例 である。  [0311] The specific example shown in Fig. 26 is an example of sentence units extracted from a Japanese spoken language corpus (http: 〃 www. Kokken.go.jp/katsudo/kenkyujyo/corpus 8 CSjZvoll7ZD03F0040).
[0312] 図 26の内容例に示すように、図 26 (b)の文 s における「大阪」の重み値は、図 26 (  [0312] As shown in the example of Fig. 26, the weight value of "Osaka" in sentence s in Fig. 26 (b) is
1  1
a)の文 s における「大阪」の参照確率の値 0· 3338と i:匕較して、 0· 6229と高くなつ The reference probability value of “Osaka” in sentence s of a) is compared with 0 · 3338 and i:
1 1
ている。また、図 26 (b)の文 s における「大阪」の重み値は、図 26 (a)の文 s における  ing. The weight value of “Osaka” in sentence s in Fig. 26 (b) is the same as that in sentence s in Fig. 26 (a).
2 2 参照確率の値 0. 3208と it較して、 0. 6675とさらに高くなつて!/、る。  2 2 The value of the reference probability is 0.36675, which is higher than that of 0.3208!
[0313] さらに、図 26 (a)の参照確率の例では、文 s における「大阪」の重み値は、文 s に「 [0313] Furthermore, in the example of the reference probability in Fig. 26 (a), the weight value of "Osaka" in the sentence s is "
2 2 アメリカ村」が出現して!/、るにも拘わらず、その「大阪」の重み値への影響 (励起)が考 慮されていないために重み値が低下している。これに対し、図 26 (b)の連想を加味し た後の重み値の例では、文 s における「大阪」の重み値は、文 s 〖こ「アメリカ村」が出  Despite the appearance of “2 2 Amerikamura”! /, The weight value has fallen because the influence (excitation) on the weight value of “Osaka” has not been considered. On the other hand, in the example of the weight value after taking into account the association shown in Fig. 26 (b), the weight value of "Osaka" in sentence s is the sentence s.
2 2  twenty two
現して 、ることによって、出現して!/、な 、「大阪」の顕現性を表わす重み値が高くなつ て 、る。「アメリカ村」と「大阪」との関連度の影響が反映されて 、るからである。 [0314] このように、文単位検索装置 1が文単位毎に記憶している重み付き単語群に対し、 参照確率と!/、う定量的な値を用いて関連度を表わした関連語群を用いて連想を加味 することにより、文単位で「アメリカ村」が注目されて ヽる場合の「大阪」の顕現性を、 文単位又は言葉の書き手又は話し手の背景文脈により近づ力せることができる。これ により、「大阪」の単語の顕現性を表わす重み値が低く算出されて、文単位の意味の まとまりが書き手又は話し手の実際の文脈と離れたように定量的に評価されてしまうこ とを回避することがでさる。 As a result, it appears! /, And the weight value representing the manifestation of “Osaka” becomes higher. This is because the influence of the degree of association between “America Village” and “Osaka” is reflected. [0314] In this way, for the weighted word group stored for each sentence unit by the sentence unit search device 1, the related word group that represents the degree of association using the reference probability and! /, A quantitative value. By using associative associations, it is possible to make the manifestation of “Osaka” closer to the context of the sentence unit or the writer / speaker of the word when “America Village” is attracting attention. Can do. As a result, the weight value representing the manifestation of the word “Osaka” is calculated low, and the meaning of the sentence unit is evaluated quantitatively so that it is separated from the actual context of the writer or speaker. It can be avoided.
[0315] 4.検索処理  [0315] 4. Search processing
次に、実施の形態 3における検索処理について説明する。「4—1.ユーザから入力 された言葉の受け付け」については、受付装置 4の CPU41が行う処理については実 施の形態 1及び 2と同様であるので、詳細な説明を省略する。  Next, the search process in the third embodiment will be described. With regard to “4-1. Receiving words input by the user”, the processing performed by the CPU 41 of the receiving device 4 is the same as in the first and second embodiments, and thus detailed description thereof is omitted.
[0316] 4- 2' .受け付けた言葉に対する連想を加味した意味のまとまりの定量ィ匕  [0316] 4- 2 '. Quantification of a group of meanings taking into account associations with accepted words
次に、文単位検索装置 1の CPU11が、受付装置 4, 4,…で受け付けた言葉のデ ータを受信した場合に、文書記憶手段 2で記憶している文書中の文を検索する処理 について説明する。受け付けた言葉に対しても、意味のまとまりの定量化、即ち当該 テキストデータの単語抽出及び単語の参照確率を算出し、さらに関連度を使用して 重み値を算出し直す。  Next, when the CPU 11 of the sentence unit search apparatus 1 receives the data of the words received by the reception apparatuses 4, 4,..., A process of searching for sentences in the document stored in the document storage means 2 Will be described. For accepted words, quantification of meaning groups, that is, word extraction of the text data and word reference probabilities are calculated, and weight values are recalculated using relevance.
[0317] 実施の形態 3では、文単位検索装置 1の CPU11は、受け付けた言葉の意味のまと まりを定量的に表わす単語と単語の参照確率との組又は顕現性ベクトル、即ち重み 付き単語群に、関連語による連想を加味する。以下に、文単位検索装置 1の CPU1 1が受け付けた言葉に対応付けた重み付き単語群の各単語の重み値を連想を加味 して算出し直し、算出し直した重み値に基づいて検索を実行する処理について説明 する。  [0317] In the third embodiment, the CPU 11 of the sentence unit search apparatus 1 sets a combination of a word and a word reference probability or a manifestation vector, that is, a weighted word, that quantitatively represents the meaning of the accepted word. Add associations with related words to the group. Below, we recalculate the weight value of each word in the weighted word group associated with the word received by the CPU 11 of the sentence unit search device 1, taking the association into account, and perform a search based on the recalculated weight value. The process to be executed will be described.
[0318] 図 27は、実施の形態 3における文単位検索装置 1及び受付装置 4の検索処理の処 理手順を示すフローチャートである。なお、図 27のフローチャートに示す処理手順で は、実施の形態 1における図 15、図 16及び図 17のフローチャートに示した検索処理 の処理手順と同一の処理については各ステップに同一の符号を用いて詳細な説明 を省略する。 [0319] 図 27のフローチャートに示す処理手順の内、二点鎖線で囲まれたステップ S4001 の処理が、実施の形態 1における図 15、図 16及び図 17のフローチャートに示した処 理手順と異なる。即ち、ステップ S411と、ステップ S412との間に以下に説明するステ ップ S4001が追加されていることが異なる。 FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the third embodiment. In the processing procedure shown in the flowchart of FIG. 27, the same reference numerals are used for the same steps as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16 and 17 in the first embodiment. Detailed description is omitted. [0319] Of the processing procedures shown in the flowchart of FIG. 27, the processing in step S4001 surrounded by the two-dot chain line is different from the processing procedures shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. . That is, the difference is that step S4001 described below is added between step S411 and step S412.
[0320] 以下に、実施の形態 3において受け付けた言葉の意味のまとまりを表わす重み付き 単語群を対応付け、予め記憶してある意味のまとまりが類似する文単位を抽出する 検索処理について以下に説明する。  [0320] In the following, search processing for associating weighted word groups representing the meaning groups of words accepted in Embodiment 3 and extracting sentence units with similar pre-stored meaning groups will be described below. To do.
[0321] CPU11は、一時記憶領域 14に夫々参照確率を算出して記憶している全単語に対 し、所定値以上の参照確率が算出された単語に絞り込み (ステップ S411)、ステップ S408において算出した参照確率を、連想をカ卩味した重み値に算出し直す (ステップ S4001)。ステップ S4001における、 CPU11による連想をカ卩味した重み値の算出し 直しの処理は、図 25のフローチャートに示した処理と同様、単語を 1つずつ選択し、 選択した一の単語への各単語の関連度と各単語の参照確率とを乗算して算出する。  [0321] The CPU 11 calculates the reference probabilities in the temporary storage area 14 and narrows down all words stored with reference probabilities greater than or equal to a predetermined value (step S411), and calculates in step S408. The calculated reference probability is recalculated to a weight value that reflects the association (step S4001). In step S4001, the CPU 11 recalculates the weight value reminiscent of the association, as in the process shown in the flowchart of FIG. 25, selects one word at a time, and selects each word for the selected word. Is calculated by multiplying the degree of relevance of each word by the reference probability of each word.
[0322] それまでの処理により、受け付けた言葉に対し、以前に受け付けた言葉力 続く流 れ上の意味のまとまりを、連想を加味した上で定量的に表わす単語と単語の参照確 率の組 (重み付き単語群)を検索要求として生成することができた。  [0322] A set of words and word reference probabilities that quantitatively represent a group of semantic meanings that follow the previously accepted verbal power for the accepted words, taking into account associations. (Weighted word group) could be generated as a search request.
[0323] CPU11はこの後、ステップ S4001で得られた連想が加味された重み付き単語群 に対し、各文毎に対応付けて記憶してある、連想が加味された重み付き単語群を読 み出して、類似する文を抽出する処理を実行する。連想が加味された重み付き単語 群についての以降の処理は実施の形態 1と同様であるので詳細な説明を省略する。  [0323] After that, the CPU 11 reads the weighted word group to which the association is added, which is stored in association with each sentence with respect to the weighted word group to which the association obtained in step S4001 is added. And execute a process of extracting a similar sentence. Since the subsequent processing for the weighted word group to which the association is added is the same as that in the first embodiment, detailed description thereof is omitted.
[0324] これにより、文単位検索装置 1は、文書記憶手段 2に記憶してある文書データから 分別される文と受け付けた言葉とで、関連語を利用して連想を加味した意味のまとま りが類似しているカゝ否かを判断し、類似すると判断された文を直接的に出力すること ができる。したがって、本発明の文単位検索方法を実施することにより、文脈上の意 味のまとまりが類似する文単位を連想を加味して効果的に抽出し、直接的に出力す ることがでさる。  [0324] As a result, the sentence unit search apparatus 1 is a group of meanings using associated words and taking into account associations between sentences separated from the document data stored in the document storage means 2 and received words. It is possible to directly output a sentence that is judged to be similar. Therefore, by executing the sentence unit search method of the present invention, it is possible to effectively extract sentence units having similar contextual meanings in association with associations and directly output them.
[0325] なお、文単位検索装置 1の CPU11は、受け付けた言葉に対して重み付き単語群 を対応付け、文毎に予め記憶してある重み付き単語群と類似しているカゝ否かを判断 する場合、図 27のフローチャートに示した処理手順のように、重み付き単語群が同一 の単語を含んでいるか否かによって判断するとは限らない。さらに同一の単語に付与 されている重み値の差分を算出し、算出した差分が小さい程類似すると判断するとは 限らない。 [0325] The CPU 11 of the sentence unit search device 1 associates a weighted word group with the received word, and determines whether the word is similar to the weighted word group stored in advance for each sentence. Judgment In this case, as in the processing procedure shown in the flowchart of FIG. 27, it is not always determined based on whether or not the weighted word group includes the same word. Furthermore, the difference between the weight values assigned to the same word is calculated, and the smaller the calculated difference, the more similar it is not necessarily determined.
[0326] 次に、文単位検索装置 1の CPU11が、受け付けた言葉と意味のまとまりが類似す る文単位を抽出する処理を、意味のまとまりを顕現性ベクトル及び関連度ベクトルで 表現し、ベクトル間の距離を算出することによって実現する場合について以下に説明 する。  [0326] Next, the CPU 11 of the sentence unit search apparatus 1 extracts a sentence unit whose meaning is similar to the accepted word, and expresses the meaning as a manifestation vector and a relevance vector. The case where this is realized by calculating the distance between them will be described below.
[0327] 図 28は、実施の形態 3におけるベクトル表現を用いた場合の文単位検索装置 1及 び受付装置 4の検索処理の処理手順を示すフローチャートである。なお、図 28のフ ローチャートに示す処理手順では、実施の形態 1における図 15、図 16及び図 17の フローチャート、及び実施の形態 2における図 19のフローチャートに示した検索処理 の処理手順と同一の処理については各ステップに同一の符号を用いて詳細な説明 を省略する。  [0327] FIG. 28 is a flowchart showing the search processing procedure of the sentence unit search device 1 and the reception device 4 when the vector representation in the third embodiment is used. Note that the processing procedure shown in the flowchart of FIG. 28 is the same as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment and the flowchart of FIG. 19 in the second embodiment. The same reference numerals are used for the respective steps, and detailed description thereof is omitted.
[0328] 図 28のフローチャートに示す処理手順の内、一点鎖線で囲まれた各ステップ S50 1力もステップ S506までの処理力 実施の形態 1における図 15、図 16及び図 17の フローチャートに示した処理手順と異なる。実施の形態 1におけるステップ S412から ステップ S416までの処理の代わりに、実施の形態 2における文単位検索装置 1の C PU11により実行されるステップ S501からステップ S506までの処理と同様の処理を 行なう。図 28のフローチャートに示す処理手順の内、二点鎖線で囲まれたステップ S 5001の処理が、実施の形態 2における図 19のフローチャートに示した処理手順と異 なる。即ち、ステップ S501と、ステップ S502との間に以下に説明するステップ S500 1が追加されていることが異なる。  [0328] Of the processing procedure shown in the flowchart of FIG. 28, each step S50 surrounded by the alternate long and short dash line is also processed up to step S506. The processing shown in the flowcharts of FIGS. 15, 16, and 17 in Embodiment 1 Different from the procedure. Instead of the processing from step S412 to step S416 in the first embodiment, processing similar to the processing from step S501 to step S506 executed by the CPU 11 of the sentence unit retrieval apparatus 1 in the second embodiment is performed. In the processing procedure shown in the flowchart of FIG. 28, the processing in step S 5001 surrounded by a two-dot chain line is different from the processing procedure shown in the flowchart of FIG. 19 in the second embodiment. That is, the difference is that step S5001 described below is added between step S501 and step S502.
[0329] 文単位検索装置 1の CPU11は、ステップ S501で算出した顕現性ベクトルを、関連 語による連想をカ卩味した顕現性ベクトルに算出し直す (ステップ S5001)。  [0329] The CPU 11 of the sentence unit search device 1 recalculates the manifestation vector calculated in step S501 into an manifestation vector reflecting the association of related words (step S5001).
[0330] CPU11はこの後、ステップ S5001で得られた連想が加味された重み付き単語群 に対し、各文毎に対応付けて記憶してある、連想が加味された顕現性ベクトルを読み 出して、類似する文を抽出する処理を実行する。連想が加味された顕現性ベクトルを 読み出して類似する文を抽出する処理は実施の形態 2と同様であるので詳細な説明 を省略する。 [0330] After that, the CPU 11 reads the weighted word group obtained in step S5001 in consideration of the association and stores the manifestation vector in consideration of association, which is stored in association with each sentence. , A process for extracting similar sentences is executed. A manifestation vector with associations added The process of reading and extracting a similar sentence is the same as in the second embodiment, and a detailed description thereof is omitted.
[0331] なお、 CPU11によるステップ S5001において、顕現性ベクトルを関連語による連 想を加味した顕現性ベクトルに算出し直す処理は、ステップ S501で算出した顕現性 ベクトルを関連度ベクトル群 (行列)で式 (7)で示したように変換して(回転させて)算 出する。具体的には、参照確率のみを要素とする多次元ベクトル v(s )に対して上述 の連想を加味した顕現性ベクトル V ( )を算出する。  [0331] Note that in step S5001 by CPU 11, the process of recalculating the manifestation vector into the manifestation vector taking into account the association with the related word is performed using the relevance vector group (matrix) of the manifestation vector calculated in step S501. Convert (rotate) and calculate as shown in equation (7). Specifically, the manifestation vector V () is calculated by adding the above association to the multidimensional vector v (s) whose element is only the reference probability.
[0332] なお、上述の図 28のフローチャートに示した処理手順の内の、 CPU11が受け付け た言葉に対応付けた顕現性ベクトルと、読み出した顕現性ベクトルとの距離を算出す るステップ S503の処理は、実施の形態 3では、具体的には以下のように算出する。 受け付けた言葉 Uに対し連想が加味されて算出し直された顕現性ベクトルが V(u ) と表わされ、読み出された、予め連想が加味されてある顕現性ベクトルが V(s )と表 わされる場合、 CPU11は以下に示す式 (8)のように、コサイン距離を算出する。  [0332] In the processing procedure shown in the flowchart of Fig. 28 described above, the processing in step S503 for calculating the distance between the manifestation vector associated with the word accepted by CPU 11 and the read manifestation vector Specifically, in the third embodiment, the calculation is as follows. The manifestation vector recalculated with the association of the received word U is represented as V (u), and the retrieved manifestation vector with the association added in advance is represented as V (s). When expressed, the CPU 11 calculates the cosine distance as shown in the following equation (8).
[0333] [数 8]  [0333] [Equation 8]
Ι ¾) ΙΙ ¾·) Ι Ι ¾) ΙΙ ¾ ·) Ι
Ν  Ν
salience wk I s ) salienced | ut ) salience wk I s) salienced | u t )
― k=\  ― K = \
( 8 )  (8)
N N  N N
' ^ alienceiw^ | ) salience(w^ \ u )'  '^ alienceiw ^ |) salience (w ^ \ u)'
' =1  '= 1
[0334] ただし、式 (8)に示したように距離を算出した場合、言葉の顕現性ベクトル V (Ui )と 、読み出した顕現性ベクトル V(s )とが近いほど、算出したコサイン距離の値は大きく なる。したがって、 CPU11はステップ S506において、算出したコサイン距離が大き い順に類似度を付与する。 [0334] However, when the distance is calculated as shown in equation (8), the closer the word manifestation vector V (Ui) and the read manifestation vector V (s) are, the closer the calculated cosine distance is. The value increases. Therefore, in step S506, the CPU 11 gives similarities in descending order of the calculated cosine distance.
[0335] 文単位検索装置 1の CPU11による上述のような処理により、連想が加味された意 味のまとまりを表わす顕現性ベクトル間の距離によって、意味のまとまりが類似する文 単位を直接的に検索することができる。ベクトル表現を用いることにより、 CPU11は、 受け付けた言葉に対応付けられる連想が加味された重み付き単語群と、予め文に対 応付けて記憶されている連想が加味された重み付き単語群とを一単語ずつ重み値 を比較して 、る処理を行なうことなしに、連想をカ卩味した上で直接的に類似して 、る か否かを判断を行うことができる。 [0335] Meaning that association was taken into account by CPU11 of sentence unit search device 1 as described above. By using the distance between the manifestation vectors representing the taste clusters, it is possible to directly search sentence units with similar meaning clusters. By using the vector expression, the CPU 11 combines a weighted word group that takes into account the association associated with the accepted word and a weighted word group that takes into account the association stored in advance in association with the sentence. Without comparing the weight values one word at a time, it is possible to determine whether or not they are directly similar to each other after taking the association into account.
[0336] また、実施の形態 3における文単位検索装置 1による場合、各文単位及び単語に 対応付けられる顕現性ベクトルは、各単語に相当する次元間が直交しない関連度が 高い単語の次元方向間の角度が小さくなるような斜交座標系で扱われる。このため、 類似するカゝ否かを判断する際にベクトル間の距離を比較した場合に、関連度が高い 単語の次元方向に要素を有して ヽる場合は類似して ヽると判断されるようになる。  [0336] Further, in the case of the sentence unit search device 1 according to the third embodiment, the manifestation vector associated with each sentence unit and the word is the dimension direction of a word having a high degree of relevance in which the dimensions corresponding to each word are not orthogonal. It is handled in an oblique coordinate system in which the angle between them becomes small. For this reason, when comparing the distances between vectors when determining whether or not they are similar, if there is an element in the dimension direction of a word that has a high degree of association, it is determined that the words are similar. Become so.
[0337] したがって、「大阪」の顕現性が高 、文単位 sが記憶されて 、る場合、受け付けた言 葉において例えば「オランダ村」の顕現性が高いときは、文単位 sは受け付けた言葉 に類似すると判断されない。しかし、受け付けた言葉において「アメリカ村」の顕現性 が高 、ときは、受け付けた言葉にぉ 、て「大阪」の顕現性が励起されて高くなるので 、文単位 sはこの受け付けた言葉に類似すると判断される可能性が高くなる。  [0337] Therefore, if the manifestation of "Osaka" is high and the sentence unit s is stored, the sentence unit s is the accepted word when the manifestation of the Dutch word is high in the accepted word. Is not judged to be similar. However, the obviousness of “America Village” in the accepted word is high. When the accepted word is high, the manifestation of “Osaka” is excited and increased, so the sentence unit s is similar to this accepted word. This increases the possibility of being judged.
[0338] これにより、受け付けた言葉に対し、連想を加味してより効果的に意味のまとまりが 類似する文単位を検索して直接的に出力することができる。  [0338] Accordingly, it is possible to search for a sentence unit having a similar group of meanings more effectively by taking associations into the accepted words and directly outputting them.
[0339] なお、実施の形態 1乃至 3では、検索結果として受信したテキストデータは、受付装 置 4が備える表示手段 46のモニタ等で表示する構成としたが、受信したテキストデー タカも音声に変換して、音声入出力手段 47のスピーカ等を介して出力する構成でも よい。これにより、ユーザは自分が音声入力した複数の言葉によって、又は他のユー ザとの会話を音声入力することで、その会話の文脈と意味のまとまりが類似する文を 検索結果として得ることができる。受け付けた言葉が話し言葉力もなる場合に、発話 では省略されている、ゼロ代名詞で表される単語をも含めた単語の顕現性が類似す る文を直接的に検索結果として得ることができる。  [0339] In the first to third embodiments, the text data received as the search result is displayed on the monitor of the display means 46 provided in the reception device 4, but the received text data is also voiced. A configuration may be adopted in which the signal is converted and output via the speaker of the audio input / output means 47. As a result, the user can obtain a sentence with similar context and meaning as a search result by using multiple words that he / she has input or by inputting a conversation with another user. . When the received words also have spoken language skills, sentences that are omitted in utterances and that have similar word manifestations, including words represented by zero pronouns, can be obtained directly as search results.
[0340] また、文単位検索装置 1の CPU11は、言葉のテキストデータを受信する都度、当 該テキストデータに対して検索された文のうち、一番優先順位の高い文を表すテキス トデータのみを受付装置 4, 4,…に送信する構成としてもよい。これにより、入力され る言葉に対する検索結果を会話の第三者の発話として提示し、鼎談を実現すること も可能である。 [0340] Each time the CPU 11 of the sentence unit search apparatus 1 receives the text data of a word, the text representing the sentence with the highest priority among the sentences searched for the text data. It is good also as a structure which transmits only data to reception apparatus 4, 4, .... In this way, it is possible to present a search result for the input word as an utterance of a third party in the conversation and realize a talk.
[0341] なお、実施の形態 1乃至実施の形態 3では、文単位検索装置 1は文毎に顕現性を 示す情報を特定して記憶したが複数の文力 なる段落 (paragraph)毎にタグ < p > < Zp >で挟み、当該段落に対して特徴パターンを特定して顕現性を示す情報を sa lience属性によって記憶させ、段落を検索結果として出力する構成としてもよい。文 又は段落に限らず、一定の意味のまとまりを表す単位であれば文節であっても構わ ない。話し言葉の場合は一文と識別できる文字列が非常に長くなることが考えられる 。多数の文節力 構成され、文節と文節は「〜も」「〜ので」等の接続助詞で続いてい るにも拘わらず、文脈が動的に変化して 、く場合は一文では意味がまとまって 、な ヽ ときがある。したがって、所定の文節の数を超えて構成される文の場合は、文節毎に 一文であるとみなして処理を行う構成としてもょ ヽ。  [0341] In the first to third embodiments, the sentence unit search apparatus 1 specifies and stores information indicating the manifestation for each sentence, but the tag < A configuration may be adopted in which p> <Zp> is sandwiched, a feature pattern is specified for the paragraph, information indicating the manifestation is stored by the salience attribute, and the paragraph is output as a search result. It is not limited to a sentence or paragraph, but may be a phrase as long as it is a unit that represents a set of certain meanings. In the case of spoken language, the character string that can be identified as one sentence can be very long. It is composed of many phrasing powers, and even though the phrasing and phrasing are continued with a connecting particle such as "~ mo" or "~ so de", the context changes dynamically, and in this case, the meaning of one sentence is summarized. There is a time. Therefore, if the sentence consists of more than a certain number of clauses, it may be configured to process the sentence as if it were one sentence per clause.
[0342] また、実施の形態 1乃至実施の形態 3では、話し言葉からなる文書データを書き言 葉力もなる文書データと区別して予め記憶しておく構成としたが、受信した言葉に対 して各単語の特徴パターンを特定して参照確率を算出する都度、文書記憶手段 2で 記憶する構成としてもよい。この際、文単位検索装置 1の CPU11は、連続して受信し た言葉が一連のものである力否かの判断を当該言葉の送信元である受付装置 4を識 別する情報と、受付装置 4がユーザの検索開始'完了操作を検知したことを示す情報 とによってすることもできる。これにより、予め文書記憶手段 2で記憶してある文書デ ータのページに該当する単位で言葉を文書記憶手段 2に記憶させることができる。  [0342] In Embodiments 1 to 3, the document data composed of spoken language is stored in advance separately from the document data that also has writing ability. However, for each received word, A configuration may be adopted in which the document storage means 2 stores the probability every time a word feature pattern is specified and a reference probability is calculated. At this time, the CPU 11 of the sentence unit search device 1 determines whether or not the consecutively received words are a series of words, information for identifying the accepting device 4 that is the transmission source of the words, and the accepting device. It is also possible to use information indicating that 4 has detected a user's search start 'completion operation. As a result, words can be stored in the document storage unit 2 in units corresponding to pages of document data stored in the document storage unit 2 in advance.
[0343] なお、実施の形態 1乃至実施の形態 3では、文書データの取得とタグ付け、参照確 率を求めるための回帰分析、更に言葉を受け付けた際の処理を文単位検索装置 1 が全て行う構成としたが、文単位検索装置と文書記憶装置とに分ける構成としてもよ い。この場合は、文書記憶装置で Webクローリングを行って文書データを取得し、さ らに形態素解析及び統語解析によってテキストデータにタグを付加して記憶しておく 。また、文書記憶装置で記憶した文書データをもとに参照確率を算出するための式 を回帰分析によって求め、求めた式を使用して、記憶した文書データに対して文毎 の単語及び単語の参照確率を記憶する処理を予めしておく。文単位検索装置は、 言葉を変換したテキストデータを受信した際に特徴パターンを特定し、文書記憶装置 力 参照確率を算出するための回帰式を取得して参照確率を算出して検索を行う。 [0343] In Embodiments 1 to 3, the sentence unit search device 1 performs all of the processing for obtaining and tagging document data, regression analysis for obtaining the reference probability, and processing when a word is received. However, it may be divided into a sentence unit search device and a document storage device. In this case, the document data is acquired by performing Web crawling in the document storage device, and further, a tag is added to the text data by morphological analysis and syntactic analysis and stored. In addition, an equation for calculating the reference probability is obtained by regression analysis based on the document data stored in the document storage device, and the sentence data is stored for each sentence using the obtained equation. The process of storing the word and the reference probability of the word is performed in advance. The sentence unit search device specifies a feature pattern when text data converted from words is received, acquires a regression formula for calculating a document storage device force reference probability, calculates a reference probability, and performs a search.
[0344] また、実施の形態 1乃至実施の形態 3では、ユーザからの文字列入力又は音声入 力等の言葉の入力は、受付装置 4によってテキストデータに変換され、文単位検索 装置 1へ送信される構成とした。これに限らず、文単位検索装置 1が、ユーザの文字 列入力操作を受け付ける入出力手段、及びユーザの音声入力を受け付ける音声入 力手段を備える構成でもよい。図 29は、本発明の文単位検索方法を文単位検索装 置 1で実施する場合の構成を示すブロック図である。この場合、文単位検索装置 1は 、 CPU11、内部バス 12、記憶手段 13、一時記憶領域 14、文書集合接続手段 16及 び補助記憶手段 17の他に、ユーザの操作を受けつけるマウス又はキーボード等の 操作手段 145、モニタ等の表示手段 146及びマイク及びスピーカ等の音声入出力手 段 147を更に備える。  [0344] In Embodiments 1 to 3, the input of words such as a character string input or a voice input from the user is converted into text data by the reception device 4, and transmitted to the sentence unit search device 1. The configuration is as follows. However, the present invention is not limited to this, and the sentence unit searching apparatus 1 may be configured to include an input / output unit that receives a user's character string input operation and a voice input unit that receives a user's voice input. FIG. 29 is a block diagram showing a configuration in the case where the sentence unit retrieval method 1 of the present invention is implemented by the sentence unit retrieval apparatus 1. In this case, the sentence unit search device 1 includes a CPU 11, an internal bus 12, a storage unit 13, a temporary storage area 14, a document set connection unit 16, and an auxiliary storage unit 17, as well as a mouse or a keyboard that accepts user operations. It further includes an operation means 145, a display means 146 such as a monitor, and a voice input / output means 147 such as a microphone and a speaker.
[0345] 図 29の構成図に示した構成の場合、文単位検索装置 1の CPU 11は、音声入力手 段から入力された音声の特徴を表わす、周波数又は会話速度等を検知し、発話に おける各単語の特徴パターンを特定することができる。各単語の文法的な特徴バタ ーンは、入力された音声を音声認識によりテキストデータに変換して当該テキストデ ータに基づ 、て検索する構成としてもょ 、。  In the configuration shown in the configuration diagram of FIG. 29, the CPU 11 of the sentence unit search device 1 detects the frequency or the conversation speed indicating the characteristics of the speech input from the speech input means, and utters it. The feature pattern of each word can be specified. The grammatical feature pattern of each word can be converted to text data by speech recognition and searched based on the text data.
[0346] 実施の形態 1乃至実施の形態 3では、受付装置 4, 4,…は、受け付けた文字列又 は音声の言葉を一定の長さに区切ってデジタルデータに変換して送信するのみの装 置として構成した。しカゝしながら、本発明の文単位検索方法を実施するためには、受 付装置 4, 4,…が記憶手段 43に記憶しているプログラムを、受付装置 4, 4,…が受 け付けた言葉を形態素解析及び統語解析、又は音素解析等の自然言語解析を実 行することができるように構成してもよい。この場合、受付装置 4, 4,…の CPU41は 、受け付けた言葉における各単語の顕現性を表わす重み値を算出し、算出した重み 付き単語群を検索要求として文単位検索装置 1へ送信する構成でもよい。  [0346] In the first to third embodiments, the accepting devices 4, 4,... Only convert the received character string or speech word into a certain length, convert it into digital data, and transmit it. It was configured as a device. However, in order to implement the sentence unit search method of the present invention, the receiving device 4, 4,... Receives the program stored in the storage means 43 by the receiving device 4, 4,. The attached words may be configured to perform natural language analysis such as morphological analysis and syntactic analysis, or phonemic analysis. In this case, the CPU 41 of the accepting devices 4, 4,... Calculates a weight value that represents the manifestation of each word in the accepted words, and transmits the calculated weighted word group to the sentence unit retrieving device 1 as a search request. But you can.
産業上の利用可能性  Industrial applicability
[0347] 本発明に係る文単位検索方法を、ユーザ間の会話を音声認識が可能なコンビユー タ装置に実施させることにより、コンピュータ装置にユーザ間の会話に参加させて鼎 談を実現する用途にも適用することが可能である。また、ユーザ間の会話又はチヤッ トの文脈の流れに応じて切り替わる会話連動型広告の提示サービスを実現する用途 にも適用可能である。会議中の文脈の流れに応じて、過去の議事録から類似関連す る議事録を提示する会議支援サービスへの適用も可能である。さらに、執筆中の文 章を言葉として受け付け、文脈の流れに応じて、関連する情報を提供する文章執筆 支援サービスへの適用も可能である。 [0347] The sentence unit search method according to the present invention is a combination capable of voice recognition of conversation between users. By implementing the data processing apparatus, the present invention can be applied to an application in which a computer apparatus participates in a conversation between users and realizes a conversation. It can also be applied to applications that provide a conversation-linked advertisement presentation service that switches according to the flow of conversation or chat context between users. It can also be applied to conference support services that present similar and related minutes from past minutes depending on the context flow during the meeting. Furthermore, it is also possible to apply it to a writing support service that accepts texts in writing as words and provides related information according to the context flow.

Claims

請求の範囲 The scope of the claims
[1] 自然言語からなる複数の文書データが記憶されている文書集合を用い、該文書集 合から取得した文書データを一又は複数の文力 なる文単位に分別しておく一方、 言葉を受け付け、受け付けた言葉に基づ 、て前記文書集合から分別してある文単位 を検索する文単位検索方法にぉ ヽて、  [1] Using a document set in which a plurality of document data composed of natural language is stored, the document data obtained from the document set is sorted into one or more sentence units of sentence power, while receiving words, Based on the accepted words, a sentence unit search method for retrieving sentence units separated from the document set is as follows:
文書データ中に連なる文単位夫々に、各文単位での重み値が付与された複数の 単語からなる重み付き単語群を対応付けて予め記憶しておくステップと、  A step of storing in advance a weighted word group composed of a plurality of words each assigned a weight value for each sentence unit in association with each sentence unit in the document data;
言葉を受け付けた場合、該言葉に、該言葉での重み値が付与された複数の単語か らなる重み付き単語群を対応付けるステップと、  When a word is accepted, a step of associating the word with a weighted word group composed of a plurality of words to which a weight value for the word is assigned;
受け付けた言葉に対応付けた重み付き単語群と類似する重み付き単語群が対応 付けて記録されて!ヽる文単位を、前記文書集合から抽出する類似文単位抽出ステツ プと、  A similar sentence unit extraction step of extracting from the document set a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with each other; and
抽出した文単位を出力するステップと  A step of outputting the extracted sentence unit;
を含むことを特徴とする文単位検索方法。  The sentence unit search method characterized by including.
[2] 前記類似文単位抽出ステップは、  [2] The similar sentence unit extraction step includes:
受け付けた言葉に対応付けた重み付き単語群の内の複数の単語の重み値の分布 と、予め分別された文単位に対応付けられている重み付き単語群の内の複数の単語 の重み値の分布とが、所定の条件を満たすカゝ否かを判断するステップと、  The distribution of the weight values of a plurality of words in the weighted word group associated with the accepted word and the weight values of the plurality of words in the weighted word group associated with the sentence unit that has been sorted in advance. Determining whether the distribution satisfies a predetermined condition;
所定の条件を満たすと判断された重み付き単語群が対応付けられている文単位を 抽出するステップと  Extracting a sentence unit associated with a group of weighted words determined to satisfy a predetermined condition;
を含むことを特徴とする請求項 1に記載の文単位検索方法。  The sentence unit search method according to claim 1, further comprising:
[3] 前記類似文単位抽出ステップは、 [3] The similar sentence unit extraction step includes:
予め分別された文単位から、受け付けた言葉に対応付けた重み付き単語群と同一 の単語を含む単語群が対応付けられた文単位を抽出するステップと、  Extracting a sentence unit associated with a word group that includes the same word as the weighted word group associated with the received word from sentence units that have been sorted in advance;
受け付けた言葉と抽出した文単位とで、対応付けられた単語群の内の同一の単語 毎に重み値の差分を算出するステップと、  Calculating a difference in weight value for each identical word in the associated word group between the accepted word and the extracted sentence unit;
抽出した文単位に、算出した差分が小さい順に優先順位を付与するステップと を含み、 抽出した文単位を、優先順位に基づいて出力する Assigning priorities to the extracted sentence units in ascending order of the calculated difference, and Output extracted sentence units based on priority
ことを特徴とする請求項 1又は 2に記載の文単位検索方法。  The sentence unit search method according to claim 1 or 2, characterized by the above-mentioned.
[4] 前記重み付き単語群を、各単語を 1次元とし、単語毎に付与される重み値の大きさ を各単語に対応する次元方向の要素として持つ多次元ベクトルとして算出するステツ プを含み、 [4] including a step of calculating the weighted word group as a multidimensional vector having each word as one dimension and having a weight value assigned to each word as a dimensional element corresponding to each word. ,
前記類似文単位抽出ステップは、  The similar sentence unit extraction step includes:
分別した文単位毎に記憶してある前記多次元ベクトルと、受け付けた言葉に対応 付けた前記多次元ベクトルとの距離を算出するステップと、  Calculating a distance between the multidimensional vector stored for each classified sentence unit and the multidimensional vector associated with the accepted word;
文単位に、算出した距離が短い順に優先順位を付与するステップと  Assigning priorities to sentence units in ascending order of calculated distance; and
を含み、  Including
付与された優先順位に従って出力する  Output according to the assigned priority
ことを特徴とする請求項 1又は 2に記載の文単位検索方法。  The sentence unit search method according to claim 1 or 2, wherein:
[5] 文単位又は受け付けた言葉に重み付き単語群を対応付ける際、 [5] When associating weighted words with sentence units or accepted words,
各単語が、前記文単位又は前記言葉よりも後続の文単位又は言葉に出現する又 は参照される参照確率を算出する参照確率算出ステップを含み、  A reference probability calculating step of calculating a reference probability that each word appears or is referenced in a sentence unit or a word subsequent to the sentence unit or the word;
算出した参照確率を各単語の重み値として付与する  Assign the calculated reference probability as a weight value for each word
ことを特徴とする請求項 1乃至 4のいずれかに記載の文単位検索方法。  The sentence unit search method according to claim 1, wherein the sentence unit search method is a sentence unit search method.
[6] 前記参照確率算出ステップは、 [6] The reference probability calculating step includes:
前記各単語が先行の文単位を含む複数の文単位に出現するパターン、又は前記 単語を先行の文単位力も参照するパターンを含む特徴パターンを特定するステップ と、  Identifying a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or a feature pattern including a pattern that also refers to the preceding sentence unit power of the word;
前記文書集合から取得された文書データ中で、前記特徴パターンと同一の特徴パ ターンが特定される単語が、後続の文単位で出現する又は参照される割合を算出す るステップと  Calculating a ratio in which word in which the same feature pattern as the feature pattern is identified in the document data acquired from the document set appears or is referred to in subsequent sentence units;
を含み、  Including
算出した割合を参照確率とする  The calculated ratio is used as the reference probability
ことを特徴とする請求項 5に記載の文単位検索方法。  The sentence unit search method according to claim 5, wherein:
[7] 前記文書集合から抽出される単語毎に、該単語の前記特徴パターンを特定する特 定ステップと、 [7] For each word extracted from the document set, a feature that specifies the feature pattern of the word Constant step,
特定した特徴パターンと同一の特徴パターンが特定される単語が、前記文書デー タ中で後続の文単位で出現したか又は参照されたかを判定する判定ステップと、 特定した特徴パターンと、該特徴パターンで特定される単語に対して判定した結果 との回帰分析を行って前記参照確率に対する前記特徴パターンの回帰係数を算出 する回帰ステップと  A step of determining whether a word for which the same feature pattern as the specified feature pattern is specified has appeared or referred to in the subsequent sentence unit in the document data; the specified feature pattern; and the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the determination result for the word specified in
を含み、  Including
文単位に重み付き単語群を対応付けて記憶しておく際、又は受け付けた言葉に重 み付き単語群を対応付ける際、  When associating and storing weighted word groups in sentence units, or associating weighted word groups with accepted words,
前記参照確率算出ステップは、  The reference probability calculating step includes:
前記文単位又は言葉毎に、該文単位又は言葉での単語の特徴パターンを特定し 特定した特徴パターンに対する前記回帰係数を使用して参照確率を算出する ことを特徴とする請求項 5に記載の文単位検索方法。  The reference probability is calculated using the regression coefficient for the specified feature pattern by specifying a feature pattern of the word in the sentence unit or the word for each sentence unit or the word according to claim 5. Sentence search method.
[8] 文単位に対しては、書き言葉からなる第 1文書集合から取得された文書データ中で 前記割合を算出し、 [8] For sentence units, the ratio is calculated in the document data obtained from the first document set consisting of written words.
受け付けた言葉に対しては、話し言葉力 なる第 2文書集合力 取得された文書デ ータ中で前記割合を算出する  For accepted words, the second document collective power that is spoken language. Calculate the ratio in the acquired document data.
ことを特徴とする請求項 6に記載の文単位検索方法。  The sentence unit search method according to claim 6, wherein:
[9] 書き言葉からなる第 1文書集合及び話し言葉からなる第 2文書集合夫々について、 前記特定ステップ、前記判定ステップ及び前記回帰ステップを実行しておき、 前記参照確率算出ステップは、 [9] For each of the first document set composed of written words and the second document set composed of spoken words, the specifying step, the determining step, and the regression step are performed, and the reference probability calculating step includes:
前記文単位で特定した単語の特徴パターンに対しては、第 1文書集合について実 行した前記回帰ステップにより算出された回帰係数を使用して参照確率を算出し、 前記受け付けた言葉で特定した単語の特徴パターンに対しては、第 2文書集合に ついて実行した前記回帰ステップで算出された回帰係数を使用して参照確率を算出 する  For the feature pattern of the word specified in the sentence unit, the reference probability is calculated using the regression coefficient calculated in the regression step performed on the first document set, and the word specified in the accepted word For the feature pattern, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.
ことを特徴とする請求項 7に記載の文単位検索方法。 [10] 前記特徴パターンは、 The sentence unit search method according to claim 7, wherein: [10] The feature pattern is:
前記単語を先行の文単位又は言葉から参照している場合の前記先行の文単位又 は言葉から前記単語が含まれる文単位又は言葉までの、文単位又は言葉の数、 前記単語が出現又は参照されている直近の先行の文単位又は言葉における前記 単語の係り受け情報、  When referring to the word from the preceding sentence unit or word, the number of sentence units or words from the preceding sentence unit or word to the sentence unit or word containing the word, or the word appearing or referenced The dependency information of the word in the last preceding sentence unit or word being
前記単語が含まれる文単位又は言葉までに出現した又は参照された回数、 前記単語が出現又は参照されている直近の先行の文単位又は言葉における前記 単語の名詞区別、  The number of occurrences or references to a sentence unit or word in which the word is included, the noun distinction of the word in the immediately preceding sentence unit or word in which the word appears or is referenced,
前記単語が出現又は参照されている直近の先行の文単位又は言葉中で前記単語 が主題であるか否か、  Whether the word is the subject in the last preceding sentence unit or word in which the word appears or is referenced,
前記単語が出現又は参照されている直近の先行の文単位又は言葉中で前記単語 が主語であるか否か、  Whether the word is the subject in the last preceding sentence unit or word in which the word appears or is referenced;
前記単語が含まれる文単位又は言葉における人称、  A sentence unit containing the word or a person in the word,
及び、  as well as,
前記単語が含まれる文単位又は言葉における品詞情報、  Part-of-speech information in sentence units or words including the word,
の内の一又は複数を含む情報で特定される  Identified by information containing one or more of
ことを特徴とする請求項 6乃至 9のいずれかに記載の文単位検索方法。  The sentence unit search method according to claim 6, wherein:
[11] 前記特徴パターンは、 [11] The feature pattern is:
前記単語を先行の文単位又は言葉から参照している場合の前記先行の文単位又 は言葉力 前記単語が含まれる文単位又は言葉までに対応する時間、  The preceding sentence unit or word power when the word is referenced from the preceding sentence unit or word, the time corresponding to the sentence unit or word including the word,
前記単語が出現又は参照されている直近の先行の文単位又は言葉中で前記単語 に対応する発話速度、  Utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or is referenced,
及び、  as well as,
前記単語が出現又は参照されている直近の先行の文単位又は言葉中で前記単語 に対応する音声の周波数  The frequency of the speech corresponding to the word in the immediately preceding preceding sentence unit or word in which the word appears or is referenced
の内の一又は複数を含む情報で特定される  Identified by information containing one or more of
ことを特徴とする請求項 6乃至 10のいずれかに記載の文単位検索方法。  The sentence unit search method according to any one of claims 6 to 10.
[12] 前記文章集合力 抽出される単語の内の一の単語について、 前記分別された文単位に対応付けられて ヽる重み付き単語群の内から、前記一の 単語が含まれる単語群であり、且つ前記一の単語の重み値が所定値以上である単 語群を抽出する第 1ステップと、 [12] The sentence gathering power For one word among the extracted words, Among the weighted word groups that are associated with the classified sentence units, the word group includes the one word, and the word group has a weight value equal to or greater than a predetermined value. The first step of extracting
該第 1ステップで抽出した単語群の各単語の重み値を単語毎に統合した値を、前 記一の単語の各単語への関連度として付与した関連単語群を作成する第 2ステップ と、  A second step of creating a related word group in which a value obtained by integrating the weight value of each word of the word group extracted in the first step for each word is given as a relevance degree to each word of the first word;
作成した関連単語群を前記一の単語に対応付けて記憶する第 3ステップと、 前記抽出された単語夫々について前記第 1ステップ乃至第 3ステップを予め実行 するステップと、  A third step of storing the created related word group in association with the one word; a step of executing the first step to the third step in advance for each of the extracted words;
文単位毎又は受け付けた言葉毎に対応付けられた重み付き単語群の各単語の重 み値夫々を、各単語に対応付けて記憶されている前記関連単語群の各単語の関連 度を使用して付与し直す関連度付加ステップと  The weight value of each word of the weighted word group associated with each sentence unit or each accepted word is used as the relevance level of each word of the related word group stored in association with each word. Relevance level addition step
を含むことを特徴とする請求項 1乃至 11のいずれか〖こ記載の文単位検索方法。  The sentence unit search method according to any one of claims 1 to 11, wherein the sentence unit search method includes:
[13] 前記第 2ステップは、 [13] In the second step,
前記抽出した単語群について、各単語群に含まれる各単語の重み値に、前記一の 単語の重み値で重み付けした総和を算出するステップと、  For the extracted word group, calculating a sum total weighted by the weight value of the one word to the weight value of each word included in each word group;
算出した総和を平均化するステップと、  Averaging the calculated sums;
作成する関連単語群の各単語の前記関連度として、各単語の重み値の平均化さ れた総和を付与するステップと  Assigning an averaged sum of weight values of each word as the degree of association of each word of the related word group to be created;
を含むことを特徴とする請求項 12に記載の文単位検索方法。  The sentence unit search method according to claim 12, further comprising:
[14] 前記関連度付加ステップは、 [14] The relevance adding step includes
文単位毎又は受け付けた言葉毎に対応付けられた重み付き単語群の各単語につ いて、  For each word in the weighted word group associated with each sentence unit or each accepted word,
各単語に対応付けて記憶されている前記関連単語群に含まれる各単語の関連度 を、前記重み付き単語群の各単語の重み値に乗算するステップと、  Multiplying the weight value of each word of the weighted word group by the degree of relevance of each word included in the related word group stored in association with each word;
乗算結果に基づいて前記重み付き単語群の各単語の重み値として付与し直すス テツプと  A step of reassigning the weight value of each word of the weighted word group based on the multiplication result;
を含むことを特徴とする請求項 12又は 13に記載の文単位検索方法。 [15] 各単語夫々につ ヽての前記関連単語群を、各単語を 1次元とし、単語毎に付与さ れる関連度の大きさを各単語に対応する次元方向の要素として持つ多次元の関連 度ベクトルとして算出するステップと The sentence unit search method according to claim 12 or 13, characterized by comprising: [15] The related word group for each word is a multi-dimensional group in which each word is one-dimensional and the degree of relevance given to each word is an element in the dimension direction corresponding to each word. Calculating as a relevance vector;
を含み、  Including
前記関連度付加ステップは、  The relevance adding step includes
分別した文単位毎に記憶してある前記多次元ベクトルを、各単語の関連度ベクトル の列によって変換する  The multi-dimensional vector stored for each sentence unit is converted by the sequence of relevance vectors for each word.
ことを特徴とする請求項 12乃至 14のいずれかに記載の文単位検索方法。  The sentence unit search method according to claim 12, wherein the sentence unit search method is a sentence unit search method.
[16] 自然言語からなる複数の文書データが記憶されている文書集合を用い、言葉を受 け付け、受け付けた言葉に基づ 、て前記文書集合を検索する文単位検索方法にお いて、 [16] In a sentence unit search method that uses a document set in which a plurality of document data composed of natural languages is stored, accepts words, and searches the document set based on the accepted words.
前記文書集合から得られる文書データを一又は複数の文からなる文単位に分別し ておくステップ、  Separating document data obtained from the document set into sentence units composed of one or more sentences;
分別した文単位毎に、該文単位に出現する単語、又は、文書データ中の先行の文 単位力も参照する単語を抽出するステップ、  Extracting a word that appears in each sentence unit, or a word that also refers to the preceding sentence unit power in the document data, for each separated sentence unit;
前記文単位に対して抽出した単語毎に、各文単位における特徴を特定して記憶し ておくステップ、  For each word extracted for each sentence unit, identifying and storing features in each sentence unit;
分別した文単位毎に、該文単位に対して抽出した単語が該文単位及び先行の文 単位で出現する場合の前記特徴の組み合わせのパターン、又は先行の文単位から 参照する場合の参照のパターンを含む特徴パターンを特定するステップ、  For each separated sentence unit, the combination pattern when the word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the reference pattern when referring from the preceding sentence unit Identifying a feature pattern including:
特定した特徴パターンと、該特徴パターンで特定された単語が後続の文単位で出 現又は参照されたか否かとを記憶しておくステップ、  Storing the identified feature pattern and whether or not the word identified by the feature pattern appears or referenced in subsequent sentence units;
前記文書集合から得られる文書中の文単位全体に対し、一の特徴パターンで特定 される単語が後続の文単位で出現又は参照される参照確率の回帰分析を行って特 徴パターンに対応する回帰係数を得る回帰学習を実行するステップ、  Regression corresponding to the feature pattern by performing regression analysis of the reference probability that the word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained from the document set. Performing regression learning to obtain coefficients,
分別した文単位毎に、  For each sentence unit,
文書データ中で先行の文単位から各文単位に至るまでに抽出された各単語につ いて、前記文単位で特定される特徴パターンに対応する前記回帰係数を使用し、前 記単語の前記参照確率を算出するステップ、 For each word extracted from the previous sentence unit to each sentence unit in the document data, the regression coefficient corresponding to the feature pattern specified in the sentence unit is used, and the previous coefficient is used. Calculating the reference probability of a written word;
算出した参照確率を夫々付与した重み付き単語群を対応付けて予め記憶しておく ステップ、  A step of storing in advance a weighted word group to which the calculated reference probabilities are respectively assigned,
言葉を受け付けた場合、受け付けた順に言葉を記憶するステップ、  If words are accepted, the step of memorizing the words in the order received,
言葉を受け付けた場合、  If you accept words,
受け付けた言葉に出現する単語又は前記言葉よりも先に受け付けた言葉力 参照 する単語を抽出するステップ、  A step of extracting a word appearing in the received word or a word power to be referred to before the word,
抽出した各単語の前記受け付けた言葉における特徴を特定するステップ、 先に受け付けた言葉で出現する場合の特徴の組み合わせのパターン、又は先に 受け付けた言葉力 参照する場合の参照のパターンを含む特徴パターンを特定する ステップ、  A feature pattern including a step of identifying features in the received words of each extracted word, a combination pattern of features when appearing in a previously received word, or a reference pattern when referring to previously received word power Identifying steps,
特定された特徴パターンに対応する前記回帰係数を使用して、前記単語の前記参 照確率を算出するステップ、  Calculating the reference probability of the word using the regression coefficient corresponding to the identified feature pattern;
算出した参照確率を夫々付与した重み付き単語群を前記言葉に対応付けるステツ プ、  A step of associating a weighted word group to which the calculated reference probabilities are assigned with the word,
前記受け付けた言葉と、予め分別されてある文単位とで、対応付けられている重み 付き単語群の内の同一の単語毎に付与されている参照確率の差分を算出するステ ップ、  A step of calculating a difference between reference probabilities assigned to each identical word in a weighted word group associated with the received word and a sentence unit that has been sorted in advance;
予め分別されてある文単位に、前記参照確率の差分が小さ!、順に優先順位を付与 するステップ、及び、  A step of assigning priorities to the sentence units that have been sorted in advance, the difference in the reference probability being small !, and
前記文単位を付与された優先順位に基づいて出力するステップ  Outputting the sentence unit based on the given priority order
を含むことを特徴とする文単位検索方法。  The sentence unit search method characterized by including.
自然言語からなる複数の文書データが記憶されて 、る文書集合から文書データを 取得する手段と、言葉を受け付ける手段とを備え、受け付けた言葉に基づいて前記 文書集合を検索する文単位検索装置において、  In a sentence unit search device that stores a plurality of document data in a natural language, includes means for acquiring document data from a document set, and means for receiving words, and searches the document set based on received words. ,
取得した文書データを一又は複数の文からなる文単位に分別する手段と、 取得した文書データ中に連なる文単位夫々に、各文単位での重み値が付与された 前記複数の単語からなる重み付き単語群を対応付けて記憶する手段と、 言葉を受け付けた場合に受け付けた順に記憶する手段と、 A means for separating the acquired document data into sentence units composed of one or a plurality of sentences, and a weight consisting of the plurality of words in which a weight value for each sentence unit is assigned to each of the sentence units connected in the acquired document data Means for storing the associated word group in association with each other; Means for storing words in the order received when receiving words;
新たに言葉を受け付ける都度、該言葉での重み値が付与された前記複数の単語 力 なる重み付き単語群を対応付ける手段と、  Means for associating a weighted word group consisting of the plurality of word powers each given a weight value with the word each time a new word is received;
予め分別された文単位から、受け付けた言葉に対応付けた重み付き単語群と類似 する重み付き単語群が対応付けて記録されている文単位を抽出する手段と、 抽出した文単位を出力する手段と  Means for extracting a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with the received word; and means for outputting the extracted sentence unit When
を備えることを特徴とする文単位検索装置。  A sentence unit search device comprising:
[18] 自然言語からなる複数の文書データが記憶されて!、る文書集合から、文書データ を取得することが可能であるコンピュータを、言葉を受け付ける手段と、受け付けた言 葉に基づいて前記文書集合を検索する手段として機能させることができるコンビユー タプログラムにおいて、 [18] A computer capable of acquiring document data from a set of documents in which a plurality of document data composed of natural language is stored, means for accepting words, and said document based on the accepted words In a computer program that can function as a means of searching a set,
取得した文書データを一又は複数の文からなる文単位に分別する手段、 取得した文書データ中に連なる文単位夫々に、各文単位での重み値が付与された 前記複数の単語からなる重み付き単語群を対応付けて記憶する手段、  A means for separating the acquired document data into sentence units composed of one or a plurality of sentences, and a weight value consisting of the plurality of words, in which a weight value is assigned to each sentence unit in the acquired document data. Means for storing word groups in association with each other;
言葉を受け付けた場合に受け付けた順に記憶する手段、  Means for storing words in the order they are received,
新たに言葉を受け付ける都度、該言葉での重み値が付与された前記複数の単語 力 なる重み付き単語群を対応付ける手段、及び、  Means for associating a weighted word group consisting of the plurality of word powers each given a weight value with the word each time a new word is received; and
予め分別された文単位から、受け付けた言葉に対応付けた重み付き単語群と類似 する重み付き単語群が対応付けて記録されている文単位を抽出する手段  Means for extracting a sentence unit in which a weighted word group similar to the weighted word group associated with the received word is recorded in association with the received word
として機能させることを特徴とするコンピュータプログラム。  A computer program that functions as a computer program.
[19] 請求項 18に記載のコンピュータプログラムを記録した、コンピュータで読み取り可 能な記録媒体。 [19] A computer-readable recording medium on which the computer program according to claim 18 is recorded.
[20] 自然言語からなる複数の文書データを記憶する手段と、記憶した文書データを、文 書データの先頭力 順に一又は複数の文力 なる文単位に分別する手段とを備え、 分別した文単位毎に、該文単位に出現する単語又は先行する文単位から参照する 単語が抽出してあり、分別した文単位毎に抽出した単語が記憶してある文書記憶装 ¾【こ; i l /、て、  [20] A means for storing a plurality of document data composed of natural language, and a means for separating the stored document data into one or more sentence units having a sentence power in order of the leading power of the document data. For each unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each separated sentence unit. And
文書データ中に連なる文単位夫々に、各文単位での重み値が付与された前記複 数の単語からなる重み付き単語群を対応付けて記憶する手段を備えること を特徴とする文書記憶装置。 Each of the sentence units connected in the document data is assigned the weight value for each sentence unit. A document storage device comprising means for storing a weighted word group consisting of a number of words in association with each other.
抽出されてある単語の内の一の単語について、  For one of the extracted words,
文単位夫々に対応付けられている重み付き単語群の内から、前記一の単語が含ま れる単語群であり、且つ前記一の単語の重み値が所定値以上である単語群を抽出 する抽出手段と、  Extraction means for extracting from the weighted word group associated with each sentence unit a word group that includes the one word and whose weight value is equal to or greater than a predetermined value. When,
該抽出手段が抽出した単語群の各単語の重み値を単語毎に統合した値を、前記 一の単語の各単語への関連度として付与した関連単語群を作成する作成手段と、 作成した関連単語群を前記一の単語に対応付けて記憶する記憶手段と を備え、  Creating means for creating a related word group in which a value obtained by integrating the weight values of each word of the word group extracted by the extracting means for each word as a degree of relevance to each word of the one word; Storage means for storing a word group in association with the one word,
前記抽出されてある単語夫々について前記抽出手段、前記作成手段及び前記記 憶手段の処理を実行するようにしてあり、各単語に対応付けて夫々の関連単語群を 記憶するようにしてあること  For each of the extracted words, the processing of the extraction means, the creation means, and the storage means is executed, and each related word group is stored in association with each word.
を特徴とする請求項 20に記載の文書記憶装置。  21. The document storage device according to claim 20, wherein:
PCT/JP2007/055448 2006-08-21 2007-03-16 Sentence search method, sentence search engine, computer program, recording medium, and document storage WO2008023470A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008530812A JP5167546B2 (en) 2006-08-21 2007-03-16 Sentence search method, sentence search device, computer program, recording medium, and document storage device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2006-224563 2006-08-21
JP2006224563 2006-08-21

Publications (1)

Publication Number Publication Date
WO2008023470A1 true WO2008023470A1 (en) 2008-02-28

Family

ID=39106564

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2007/055448 WO2008023470A1 (en) 2006-08-21 2007-03-16 Sentence search method, sentence search engine, computer program, recording medium, and document storage

Country Status (2)

Country Link
JP (1) JP5167546B2 (en)
WO (1) WO2008023470A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009282936A (en) * 2008-05-26 2009-12-03 Nippon Telegr & Teleph Corp <Ntt> Selection type information presentation device and selection type information presentation processing program
JP2013140500A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Word extraction device, method, and program
JP2013140499A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Method, apparatus, and program for extracting word
JP2015506509A (en) * 2011-12-28 2015-03-02 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 Method and system for generating evaluation information and computer storage medium
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
CN110083681A (en) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
JP2020042771A (en) * 2018-09-07 2020-03-19 台達電子工業股▲ふん▼有限公司Delta Electronics,Inc. Data analysis method and data analysis system
US10614065B2 (en) 2016-10-26 2020-04-07 Toyota Mapmaster Incorporated Controlling search execution time for voice input facility searching
JP2020057105A (en) * 2018-09-28 2020-04-09 株式会社リコー Language processing method, language processing program and language processing device
JP2020149369A (en) * 2019-03-13 2020-09-17 株式会社東芝 Dialog control system, dialog control method, and program
CN111753498A (en) * 2020-08-10 2020-10-09 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN112784577A (en) * 2021-01-26 2021-05-11 鲁巧巧 Sentence association learning system for English teaching
CN113761157A (en) * 2021-05-28 2021-12-07 腾讯科技(深圳)有限公司 Response statement generation method and device
US11397776B2 (en) 2019-01-31 2022-07-26 At&T Intellectual Property I, L.P. Systems and methods for automated information retrieval
US11409804B2 (en) 2018-09-07 2022-08-09 Delta Electronics, Inc. Data analysis method and data analysis system thereof for searching learning sections

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287291B (en) * 2019-07-03 2021-11-02 桂林电子科技大学 Unsupervised method for analyzing running questions of English short sentences

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06162092A (en) * 1992-11-18 1994-06-10 Fujitsu Ltd Information retrieval device
JP2004234175A (en) * 2003-01-29 2004-08-19 Matsushita Electric Ind Co Ltd Contents retrieval device and program therefor
JP2005250762A (en) * 2004-03-03 2005-09-15 Mitsubishi Electric Corp Dictionary generation device, dictionary generation method and dictionary generation program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06162092A (en) * 1992-11-18 1994-06-10 Fujitsu Ltd Information retrieval device
JP2004234175A (en) * 2003-01-29 2004-08-19 Matsushita Electric Ind Co Ltd Contents retrieval device and program therefor
JP2005250762A (en) * 2004-03-03 2005-09-15 Mitsubishi Electric Corp Dictionary generation device, dictionary generation method and dictionary generation program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOKUNAGA T.: "Gengo to Keisan 5 Joho Kensaku to Gengo Shori", vol. 1ST ED., 1999, ZAIDAN HOJIN UNIVERSITY OF TOKYO PRESS, XP003021201 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009282936A (en) * 2008-05-26 2009-12-03 Nippon Telegr & Teleph Corp <Ntt> Selection type information presentation device and selection type information presentation processing program
JP2015506509A (en) * 2011-12-28 2015-03-02 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 Method and system for generating evaluation information and computer storage medium
JP2013140500A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Word extraction device, method, and program
JP2013140499A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Method, apparatus, and program for extracting word
US10614065B2 (en) 2016-10-26 2020-04-07 Toyota Mapmaster Incorporated Controlling search execution time for voice input facility searching
CN108710613A (en) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of text similarity
JP2020042771A (en) * 2018-09-07 2020-03-19 台達電子工業股▲ふん▼有限公司Delta Electronics,Inc. Data analysis method and data analysis system
US11409804B2 (en) 2018-09-07 2022-08-09 Delta Electronics, Inc. Data analysis method and data analysis system thereof for searching learning sections
JP2020057105A (en) * 2018-09-28 2020-04-09 株式会社リコー Language processing method, language processing program and language processing device
JP7147439B2 (en) 2018-09-28 2022-10-05 株式会社リコー Language processing method, language processing program and language processing device
US11397776B2 (en) 2019-01-31 2022-07-26 At&T Intellectual Property I, L.P. Systems and methods for automated information retrieval
JP7055764B2 (en) 2019-03-13 2022-04-18 株式会社東芝 Dialogue control system, dialogue control method and program
JP2020149369A (en) * 2019-03-13 2020-09-17 株式会社東芝 Dialog control system, dialog control method, and program
CN110083681A (en) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
CN110083681B (en) * 2019-04-12 2024-02-09 中国平安财产保险股份有限公司 Searching method, device and terminal based on data analysis
CN111753498A (en) * 2020-08-10 2020-10-09 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN111753498B (en) * 2020-08-10 2024-01-26 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN112784577A (en) * 2021-01-26 2021-05-11 鲁巧巧 Sentence association learning system for English teaching
CN112784577B (en) * 2021-01-26 2022-11-18 鲁巧巧 Sentence association learning system for English teaching
CN113761157A (en) * 2021-05-28 2021-12-07 腾讯科技(深圳)有限公司 Response statement generation method and device

Also Published As

Publication number Publication date
JP5167546B2 (en) 2013-03-21
JPWO2008023470A1 (en) 2010-01-07

Similar Documents

Publication Publication Date Title
JP5167546B2 (en) Sentence search method, sentence search device, computer program, recording medium, and document storage device
US9330661B2 (en) Accuracy improvement of spoken queries transcription using co-occurrence information
KR101279707B1 (en) Definition extraction
US20040148170A1 (en) Statistical classifiers for spoken language understanding and command/control scenarios
US20040073874A1 (en) Device for retrieving data from a knowledge-based text
US20040148154A1 (en) System for using statistical classifiers for spoken language understanding
US20070198511A1 (en) Method, medium, and system retrieving a media file based on extracted partial keyword
EP2348427B1 (en) Speech retrieval apparatus and speech retrieval method
Favre et al. Robust named entity extraction from large spoken archives
AU2006317628A1 (en) Word recognition using ontologies
WO1998044484A1 (en) Text normalization using a context-free grammar
JP2004133880A (en) Method for constructing dynamic vocabulary for speech recognizer used in database for indexed document
EP1331574B1 (en) Named entity interface for multiple client application programs
Sen et al. Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
Kaushik et al. Automatic audio sentiment extraction using keyword spotting.
CN115759071A (en) Government affair sensitive information identification system and method based on big data
Dyriv et al. The user's psychological state identification based on Big Data analysis for person's electronic diary
Lin et al. Enhanced BERT-based ranking models for spoken document retrieval
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
Rosset et al. The LIMSI participation in the QAst track
Masumura et al. Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Sen et al. Audio indexing
JP2008204133A (en) Answer search apparatus and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07738893

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2008530812

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07738893

Country of ref document: EP

Kind code of ref document: A1