WO2008023470A1 - Procédé de recherche de phrase, moteur de recherche de phrase, programme informatique, support d'enregistrement et stockage de document - Google Patents

Procédé de recherche de phrase, moteur de recherche de phrase, programme informatique, support d'enregistrement et stockage de document Download PDF

Info

Publication number
WO2008023470A1
WO2008023470A1 PCT/JP2007/055448 JP2007055448W WO2008023470A1 WO 2008023470 A1 WO2008023470 A1 WO 2008023470A1 JP 2007055448 W JP2007055448 W JP 2007055448W WO 2008023470 A1 WO2008023470 A1 WO 2008023470A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sentence
sentence unit
words
weighted
Prior art date
Application number
PCT/JP2007/055448
Other languages
English (en)
Japanese (ja)
Inventor
Shun Shiramatsu
Kazunori Komatani
Hiroshi Okuno
Original Assignee
Kyoto University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kyoto University filed Critical Kyoto University
Priority to JP2008530812A priority Critical patent/JP5167546B2/ja
Publication of WO2008023470A1 publication Critical patent/WO2008023470A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • Sentence search method sentence search device, computer program, recording medium, and document storage device
  • the present invention relates to a search method for searching a large number of document data and searching for a document collective power based on words such as text and voice received by the user for searching.
  • sentence units that can be directly searched for sentence units whose meanings are similar to accepted words from sentence units that are groups of meanings in a document whose meaning changes dynamically in the context flow
  • the present invention relates to a retrieval method, a sentence unit retrieval apparatus, a computer program that causes a computer to function as the sentence unit retrieval apparatus, a computer-readable recording medium that records the computer program, and a document storage apparatus.
  • Conventional document search services include the following. Documents published on the Internet are automatically collected and stored, and for each document, words appearing in the document are stored together with the appearance probability in the document, and words such as keywords or sentences are accepted. In such a case, the document is extracted by assigning priorities in descending order of the probability of occurrence of words included in the keyword or sentence that has received the stored document collective power, and the sentence or sentence including the word is extracted from the extracted document. Output paragraph.
  • a user who uses a document search service needs to think about keywords related to searching information he or she wants to know.
  • the user wants to know about economic policies and international policies. Even if the user's input is in natural language, human beings decide which of the words “America, President, other countries, economy, problems, outbreaks, countermeasures” is most important. It can be grasped when reading, but it is difficult to express quantitatively as the amount of information handled by the device or computer. Therefore, although all the keywords are included, it is assumed that a document describing “American economic problems and countermeasures of presidents of other countries” will be output.
  • the keyword entered for the search is included in the document to be searched, although the keyword entered for the search does not appear frequently but has an important meaning in context. is there.
  • the subject word is expressed with a pronoun or zero pronoun. Therefore, the user who searches for the information he / she wants to know may be the information he / she wants to obtain as a search result, in which the sentence or paragraph in which the keyword input for the search is expressed in the demonstrative pronoun or zero pronoun.
  • priority is given to the search results with the actual appearance frequency, the appearance frequency of the keyword input by the user is low, so it is excluded from the candidates by narrowing down and is not output as the search results.
  • the word in the document is extracted, and the document is subjected to morphological analysis using the part-of-speech information of the word, the dependency information between the words, and the information specifying the anaphoric relationship with the demonstrative pronoun or zero pronoun.
  • a technique has been proposed in which a document is retrieved by a device or a computer, a question is answered, and machine translation is performed based on the stored information. ).
  • Non-Patent Document 1 Hiroshi Hashida “Global Document Modification” The Japanese Society for Artificial Intelligence (11th) Proceedings p p. 62-63 (1997)
  • the user's attention object (priority object) in each sentence or each utterance dynamically changes according to the context or the context flow of the sentence.
  • the weight representing the degree of attention to words in conversations and sentences dynamically changes. Therefore, in order to realize a service that retrieves information related to conversations and sentences, it is necessary to track dynamic changes in word weight according to the context.
  • Non-Patent Document 1 automatically analyzes information that can be identified in the context of grammar, such as part-of-speech information, and supplements, correlates, or depends on demonstrative pronouns or zero pronouns. Information can be added to the document. By adding the information, the noun being referred to can be used as the frequency of appearance, so the relationship between words in sentences or paragraphs can be analyzed with the added information. However, the degree of attention in each sentence or paragraph, ie the manifestation, cannot be measured quantitatively.
  • Non-Patent Document 1 can be applied to the realization of a question response in which a computer responds to a question in a natural sentence in consideration of a word or the like omitted in the question sentence.
  • it is easy to calculate the contextual meaning of conversations by multiple users as a quantitative value, and to generate and present utterances according to the user's conversation context as third party utterances. Not.
  • the present invention has been made in view of such circumstances, and for each sentence unit of one or a plurality of sentence powers, a weight value indicating a word manifestation in the sentence unit is assigned. Word words are associated with each other and stored, and words accepted for search are also associated with weighted word groups assigned weight values in the words, and the weighted word groups are similar. The sentence unit is extracted and output.
  • Sentence units in a document whose meaning changes dynamically in the context flow automatically generating information that reflects the context of the previous word power in the user's consciousness from the received words
  • the sentence unit search method, sentence unit search apparatus, and computer that can directly search sentence units with similar contextual meanings represented by the information generated from the received word power It is an object of the present invention to provide a computer program that functions as a search device, and a computer-readable recording medium that records the computer program.
  • An object of the present invention is to refer to the probability or occurrence of a weight value indicating the manifestation of each word in a weighted word group associated with a sentence unit or a received word in the subsequent sentence unit or word.
  • an object of the present invention is to generate user power by quantitatively calculating the degree of association with related words and reflecting the degree of association in the manifestation of each word in each sentence unit or word. Even if it does not appear in a written word or written sentence, it is effective to use a sentence unit that reminds the user when he / she utters a word, or when writing or writing.
  • An object of the present invention is to provide a sentence unit retrieval method and a document storage device which can be retrieved. Means for solving the problem
  • the sentence unit retrieval method uses a document set in which a plurality of document data composed of natural languages is stored, and the document data obtained from the document set is a sentence that also has one or more sentence strengths.
  • the sentence unit search method that accepts words and retrieves sentence units that are separated from the document set based on the accepted words while separating them into units!
  • the similar sentence unit extraction step is preliminarily classified from the distribution of weight values of a plurality of words in the weighted word group associated with the received word.
  • a step of determining whether or not a distribution of weight values of a plurality of words in a weighted word group associated with a sentence unit satisfies a predetermined condition and a predetermined condition is determined
  • a step of extracting a sentence unit associated with the weighted word group is determined.
  • the similar sentence unit extraction step includes a word including the same word as the weighted word group associated with the received word from the sentence units sorted in advance.
  • a step of assigning priorities to the extracted sentence units in ascending order of the calculated difference, and the extracted sentence units are output based on the priorities.
  • the weighted word group is such that each word is one-dimensional, and the size of the weight value assigned to each word is an element in the dimension direction corresponding to each word.
  • Have A step of calculating as a multidimensional vector, and the step of extracting similar sentence units includes: the multidimensional vector stored for each separated sentence unit; and the multidimensional vector associated with the received word.
  • the method includes a step of calculating a distance and a step of assigning priorities in order of the calculated distance being short V in sentence units, and outputting according to the given priorities.
  • each word appears in the sentence unit or a sentence unit or word subsequent to the word.
  • a reference probability calculation step of calculating a reference probability to be referred to or referred to, and the calculated reference probability is assigned as a weight value of each word.
  • the reference probability calculating step refers to a pattern in which each word appears in a plurality of sentence units including a preceding sentence unit, or refers to the preceding word unit power of the word.
  • a specifying step for specifying the feature pattern of the word and a feature pattern identical to the specified feature pattern are provided.
  • a determination step for determining whether the specified word has appeared or referenced in the subsequent sentence unit in the document data, the specified feature pattern, and the determination for the word specified by the feature pattern A regression step of calculating a regression coefficient of the feature pattern with respect to the reference probability by performing a regression analysis with the result of the analysis, and storing or accepting a weighted word group in association with each sentence
  • the reference probability calculating step specifies a feature pattern of the word in the sentence unit or word for each sentence unit or word, and uses the identified feature pattern. And calculates the reference probability using said regression coefficients.
  • the first document collective power composed of written words is used to calculate the ratio in the acquired document data.
  • Spoken language ability Second document gathering power Calculate the ratio in the acquired document data It is characterized by that.
  • the sentence unit search method performs the specifying step, the determining step, and the regression step for each of the first document set made up of written words and the second document set made up of spoken words,
  • the reference probability calculation step calculates a reference probability using the regression coefficient calculated by the regression step performed on the first document set for the feature pattern of the word specified in the sentence unit, For the feature pattern of the word specified by the accepted word, the reference probability is calculated using the regression coefficient calculated in the regression step executed for the second document set.
  • the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the number of sentence units or words up to the word, the dependency information of the word in the immediately preceding sentence unit or word in which the word appears or is referenced, or the sentence unit or word that contains the word Or the number of times it has been referenced, the noun distinction of the word in the last preceding sentence unit or word in which the word appears or referenced, or in the last preceding sentence unit or word in which the word has appeared or referenced Whether the word is the subject, whether it is the last preceding sentence unit in which the word appears or referenced, whether the word is the subject in the word, the sentence unit in which the word is included, or In words Personal information and sentence units including the word or part-of-speech information in the word.
  • the feature pattern includes the sentence unit including the word from the preceding sentence unit or word when the word is referenced from the preceding sentence unit or word. Or the time corresponding to the word, the utterance speed corresponding to the word in the last preceding sentence unit or word in which the word appears or referenced, and the last preceding sentence in which the word appears or referenced It is specified by information including one or more of voice frequencies corresponding to the word in a sentence unit or word.
  • the sentence unit search method is the weighted word group associated with the sorted sentence unit by one word among the words extracted from the sentence set.
  • a word group including the one word, and the weight value of the one word is predetermined.
  • the first step of extracting a word group that is greater than or equal to the value, and the value obtained by integrating the word weight values of the word group extracted in the first step for each word is defined as the degree of relevance of the one word to each word.
  • a second step of creating the related word group assigned in step 3 a third step of storing the created related word group in association with the one word, and the first to third steps for each of the extracted words
  • Each word of the related word group stored in association with each word, the weight value of each word of the weighted word group associated with each sentence unit or each accepted word.
  • a relevance addition step for re-assigning using the relevance level.
  • the weight value of each word included in each word group is weighted by the weight value of the one word.
  • a step of calculating the added sum, a step of averaging the calculated sum, and an average sum of weight values of each word is given as the relevance of each word of the related word group to be created And a step.
  • the relevance adding step stores each word of the weighted word group associated with each sentence unit or each accepted word in association with each word. Multiplying the degree of relevance of each word included in the related word group by the weight value of each word of the weighted word group, and each word of the weighted word group based on the multiplication result And a step of reassigning as a weight value.
  • the sentence unit search method relates to the related word group for each word, wherein each word is one dimension, and the degree of relevance given to each word is a dimension corresponding to each word.
  • Calculating as a multidimensional relevance vector having a direction element, and the relevance adding step described above uses the multidimensional vector stored for each classified sentence unit as the relevance vector of each word. It is characterized by converting according to a column.
  • the sentence unit search method uses a document set in which a plurality of document data consisting of natural language is stored, accepts words, and retrieves the document set based on the accepted words.
  • the step of separating the document data obtained from the document set into sentence units having one or more sentence powers A step of extracting a word that appears in a sentence unit or a word to be referred to in the preceding sentence unit in document data, and for each word extracted for the sentence unit, a feature in each sentence unit is specified and stored.
  • a step of placing, referring to a pattern of the combination of the features when a word extracted for the sentence unit appears in the sentence unit and the preceding sentence unit, or the preceding sentence unit power A step of specifying a feature pattern including a reference pattern, storing a specified feature pattern and whether or not a word specified in the feature pattern has appeared or referred to in a subsequent sentence unit, A feature pattern is obtained by performing regression analysis of the reference probability that a word specified by one feature pattern appears or is referenced in the subsequent sentence unit for the whole sentence unit in the document obtained by the resultant force. Step of executing regression learning to obtain the corresponding regression coefficient, for each sentence unit, each word extracted up to each sentence unit in the document data is identified in the sentence unit.
  • a step of calculating the reference probability of the word using the regression coefficient corresponding to the feature pattern a step of preliminarily storing a weighted word group assigned with the calculated reference probability, If accepted, a step of storing words in the order of acceptance; if a word is accepted, extracting a word that appears in the accepted word or a word that also refers to the word power received earlier than the word; A step of identifying features in the accepted words, a pattern of combinations of features when appearing in previously accepted words, or a first accepting
  • the step of specifying a feature pattern including a reference pattern when referring to, the step of calculating the reference probability of the word using the regression coefficient corresponding to the specified feature pattern, and the calculated reference A step of associating a weighted word group to which probabilities are respectively assigned to the above-mentioned words, for each of the same words in the weighted word groups associated with the received words and sentence units that are sorted in advance.
  • a step of calculating a difference between assigned reference probabilities a step of assigning priorities to sentence units that have been sorted in advance, in order of increasing difference of the reference probabilities, and a priority order to which the sentence units are assigned. And a step of outputting based on.
  • the sentence unit search device comprises means for acquiring document data from a set of documents in which a plurality of document data consisting of natural language is stored, and means for receiving words.
  • a means for separating the acquired document data into sentence units consisting of one or more sentences, and a sentence unit connected in the acquired document data A means for associating and storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence, a means for storing the words in the order received when words are received, and a new
  • a means for associating a weighted word group composed of the plurality of words assigned a weight value in the word a weighted word group associated with the received word from a pre-sorted sentence unit, It comprises means for extracting a sentence unit in which similar weighted word groups are recorded in association with each other, and means for outputting the extracted sentence unit.
  • a computer program has received a computer capable of acquiring document data from a document set in which a plurality of document data composed of natural language is stored, and means for receiving words.
  • a computer program that can function as a means for searching the document set based on words, a means for separating the acquired document data into one or more sentence units, which are connected to the acquired document data Means for storing a weighted word group composed of the plurality of words assigned with a weight value for each sentence unit in association with each sentence unit, means for storing in the order received when words are received, new
  • a means for associating a weighted word group consisting of a plurality of words to which a weight value for the word is assigned It is characterized by functioning as means for extracting sentence units in which weighted word groups similar to weighted word groups associated with received words are recorded in association with the received words.
  • the computer-readable recording medium according to the nineteenth aspect of the invention is characterized in that the computer program of the eighteenth aspect of the invention is recorded.
  • the document storage device is a means for storing a plurality of document data composed of a natural language, and the stored document data is divided into sentence units composed of one or a plurality of sentences in order from the top of the document data. For each sentence unit, a word that appears in the sentence unit or a word that is referred to from the preceding sentence unit is extracted, and the extracted word is stored for each sentence unit.
  • Each sentence unit in the document data in the document storage device And means for storing a weighted word group composed of the plurality of words to which a weight value is assigned for each sentence in association with each other.
  • the one word is included from the weighted word group associated with each sentence unit.
  • a means for creating a related word group given as a degree of relevance to each word of the one word, and a storage means for storing the created related word group in association with the one word, the extracted The processing of the extraction means, the creation means, and the storage means is executed for each word, and each related word group is stored in association with each word. .
  • document data is acquired from a document set in which document data composed of natural language is recorded, and the acquired document data is further one or more.
  • the sentence is divided into sentence units. For each sentence unit, each word that appears in the document set is given a weight value in that sentence unit, and a weighted word group of words assigned the weight value is stored in association with each sentence unit.
  • the weighted word group of words to which the weight value for the word is assigned is also associated with the accepted word.
  • a sentence unit that is associated with a weighted word group similar to the weighted word group associated with the accepted word is extracted from the sentence units that have been sorted in advance and output.
  • the weighted word group stored in advance in association with the sentence unit when extracting the sentence unit associated with the similar weighted word group in the first invention.
  • the distribution of the weight values of multiple words in is similar to the distribution of the weight values of multiple words in the weighted word group associated with the received word by determining whether or not a predetermined condition is satisfied.
  • the sentence unit associated with the weighted word group determined to be similar is extracted.
  • the weighted word group in the first invention is a multidimensional having each word as one dimension and having a weight value given to each word as a dimension element corresponding to each word. Obtained as a vector. Whether or not the weighted word groups are similar is determined based on whether or not the distance between the weighted word groups, that is, the distance between the multidimensional envelopes is short. The extracted sentence units are output in the order in which the distance between the multidimensional outer regions is short, that is, the weighted word groups are similar.
  • the reference probability calculated in the fifth invention is the preceding sentence unit power specified for each word, the pattern of appearance up to each sentence unit, or from the preceding sentence unit Calculated as the rate at which words with the same feature pattern as the feature pattern including the reference pattern appear or are referenced in subsequent sentence units in the document set
  • the feature pattern specified for each word from which document collection power is also extracted, and the word for which the feature pattern is specified has appeared in subsequent sentence units in the document in the document set, or A regression analysis is performed on the determination result of whether the word is referred to, and a regression coefficient of the feature pattern with respect to the reference probability that the word appears or is referenced in the subsequent sentence unit is calculated.
  • the reference probabilities calculated in the fifth invention are calculated from the feature patterns and regression coefficients of each word specified for each word.
  • the document set is divided into a first document set made up of written words and a second document set made up of spoken word power.
  • the reference probability given to each word in the weighted word group associated with the sentence unit is calculated based on the first document set, and the reference given to each word in the weighted word group associated with the accepted word The probability is calculated based on the second document set.
  • each word Dependent information on the number of words up to and including the current sentence unit or word when appearing or referenced in the preceding sentence unit or word Information such as the number of occurrences or references, word noun distinction, whether the word is the subject, whether the word is the subject, word personality, word part-of-speech information, etc. is quantified.
  • the reference probability when the reference probability is calculated in the sixth invention to the tenth invention, it appears or refers to the preceding sentence unit or word as a feature for specifying the feature pattern of each word. If it is, the time from the preceding sentence unit or word, the speech rate corresponding to the word when it appears or referenced, and the information of the high and low frequency of the voice are handled quantitatively.
  • the word from which the document collecting power is also extracted. Then, a weighted word group whose weight value is not less than a predetermined value is extracted.
  • One weighted word group is created as a related word group by integrating the weight value of each word of a plurality of weighted word groups extracted from the one word. The degree of relevance of each word in the created related word group represents the depth of relation to the weight value of each word when a weight value greater than a predetermined value is given to one word.
  • a group of related words is generated and stored for each word extracted from the document set. The weight value of each word of the weighted word group associated with each sentence unit or word is reassigned using the relevance level of each word of the related word group associated with each word.
  • the word group extracted as a weighted word group whose weight value is greater than or equal to a predetermined value is The sum total weighted by the weight value for the one word in the weighted word group is calculated. The sum is averaged, and the sum of the weight values averaged for each word is given as the relevance of each word in the related word group.
  • each word of the weighted word group in which the relevance level of each word of the related word group stored in the twelfth or thirteenth invention is associated with each sentence unit or each accepted word. And the multiplication result is reassigned as the weight value of each word in the weighted word group. If attention is paid to one word in the weighted word group, it corresponds to one word. The relevance level of each word in the related word group is used. Higher relevance is obtained by multiplying the weight value of each word other than one word in the weighted word group by the relevance level of each word of the related word group associated with the one word. The influence of the weight value of the one word from the weight value of another word is taken into account.
  • each word is one-dimensional
  • the degree of relevance given to each word is a dimensional element corresponding to each word. Obtained as a multidimensional relevance vector.
  • the multidimensional vector associated with each sentence unit or word is converted by a matrix of column power of related word vectors for each word.
  • the multidimensional vector is represented by a multidimensional vector in an oblique coordinate system in which the distance between each one-dimensional word is high in the degree of relevance and the distance between the words is short.
  • a multidimensional vector representing a weighted word group has a high degree of association with a word included therein and is rotated in the direction of the word axis, and the distance between the multidimensional vectors including a word with a high degree of association is shorter.
  • a word to be referred to from the sentence unit or the preceding sentence unit is extracted.
  • a sentence pattern is identified, and a feature pattern including a pattern of combination of features leading to each sentence unit or a reference pattern from a preceding sentence unit of each word is identified.
  • the reference probabilities for each extracted word are calculated and stored in advance as sentence-weighted word groups for each sentence.
  • a feature pattern based on the preceding word is also specified for the accepted word, the reference probability of each word is calculated, and a weighted word group is associated.
  • Pre-stored sentence units are output with priorities assigned in ascending order of difference in reference probabilities for the same word as the weighted word group of accepted words.
  • a weighted word group to which a word weight is assigned in that sentence unit is stored in association with each other.
  • a weighted word group in which a weight value for each sentence unit of a plurality of words is assigned to each sentence unit having one or more sentence powers in the acquired document data.
  • the word group with weight values is a set of weight values of each word in each sentence unit, and can be estimated as information indicating a group of meanings in each sentence unit.
  • the weighted word group in each sentence unit in the separated sentence unit is a group of meanings in the whole document.
  • it can be understood as a group of meanings that dynamically change in a time series in the context flow that follows the previous sentence in the document.
  • whether or not the weighted word group is similar is determined based on the distribution of the weight values of a plurality of words in the weighted word group of the accepted words and the weighted word group stored in advance.
  • the weighted word group of words received by the stored weighted word group and It can be said that they are similar.
  • the predetermined condition that can determine that the weighted word groups are similar is a condition that the distribution of the weight value of each word is similar! It can be said.
  • the ratio of the weight value of one word to the weight value of another word, the weight value of one word in the other weighted word group to the weight value of another word When the ratio is also stored, it can be determined that the weighted word groups are similar to each other.
  • the predetermined condition can be determined by setting whether or not the weight value of each word is equal to or greater than a predetermined value.
  • the difference between the weight values of the same word is obtained. It is also possible to determine whether or not it is similar depending on whether it is small or not.
  • the weighted word group as a multi-dimensional vector having each word as one dimension and having the sentence unit of each word or the weight value in the word as an element for each dimension component.
  • a group of meanings for each sentence or word can be treated as a quantitative vector.
  • a sentence unit or a group of meanings for each word as a quantitative multidimensional vector, using a computer capable of vector calculation, a sentence unit stored as a vector associated with the accepted word Similar sentence units can be directly extracted by calculating the distance to the vector associated with each.
  • the conditions that the accepted words or the multi-dimensional vector of sentence units sorted in advance are satisfied are set according to which space in the multi-dimensional space corresponds to power or not. And similar sentence units can be extracted directly.
  • the document set is not limited to V, a set of document data having a so-called written language ability. Therefore, it is not necessarily a sentence unit in which they are separated and a sentence unit that has written language ability.
  • Document data means data that has already been stored and is distinguished from words that are received in real time, and may be document data in which spoken dialogues are written in order.
  • the accepted words are not limited to words, sentences, and the like that are input for the purpose of search, but may be, for example, each utterance including a voice during a dialogue between users.
  • Sentence units are extracted based on weighted word groups to which weight values for each utterance are assigned, so that the meaning is considered considering that the meaning changes dynamically and chronologically for each utterance during the conversation.
  • a cluster can be estimated for each utterance. Therefore, it is possible to extract and present sentence units similar to the presumed meaning group for each utterance.
  • the weight value of each word of the weighted word group is given as a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed.
  • a reference probability that appears or referred to in subsequent sentence units or words, so that the weight value of each word is noticed.
  • the reference probability can be expressed as the degree to which each word in the sentence unit is noticed, that is, the manifestation.
  • a reference probability is calculated based on a document set having written language ability, and when a received word is a spoken word, The reference probability is learned and calculated based on a document set that also has spoken language skills. As a result, sentence units with more similar meanings can be output based on the characteristics that differ between written and spoken language.
  • the degree of association from each word is quantitatively calculated and stored for each word.
  • the weight value of each word in the weighted word group is recalculated based on the weight value of the other word and the relevance of the word to each of the word forces.
  • the weight value of one word can reflect the influence of the weight value of a word having a high degree of association with one word among other words. That is, when the weight value of a word having a high degree of association with one word is high, it can be reproduced that the weight value of one word is high.
  • a related word group for one word is expressed as a relevance degree vector and a weighted word group is expressed as a multidimensional vector
  • the multidimensional vector is converted with a matrix composed of a sequence of relevance vectors for each word. This shortens the distance between the multidimensional vectors representing the weighted word group including the words having a high degree of association.
  • the degree of relevance to the one word is high, and the influence of the word weight value is used as the weight value of the one word. It can be reflected. Reflecting the degree of relevance in the manifestation of each word in each sentence unit or word, the sentence unit that appears in the accepted word, even if it is recognized by the user, is effective It has an excellent effect such as being able to search automatically.
  • FIG. 1 is an explanatory diagram showing an outline of a sentence unit search method according to the present invention.
  • FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search device according to the first embodiment.
  • FIG. 3 A flowchart showing a processing procedure in which the CPU of the sentence unit search device according to the first embodiment performs tagging and word extraction on the acquired morphological analysis and syntactic analysis processing on the acquired document data and stores them. is there.
  • FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means in the first embodiment.
  • FIG. 5 is an explanatory diagram showing an example of document data that the CPU of the sentence unit search device according to the first embodiment gives the result of morphological analysis and syntactic analysis and stores in the document storage means.
  • FIG. 6 is an explanatory diagram showing an example of a list of extracted words for all document data acquired by the CPU of the sentence unit search device according to the first embodiment.
  • FIG. 7 The CPU of the sentence unit search apparatus according to Embodiment 1 extracts a sample from the tagged document data stored in the document storage means and performs a regression analysis to calculate the reference probability. It is a flowchart which shows the process sequence which estimates a regression type.
  • FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in document data stored in the document storage unit in the first embodiment.
  • FIG. 9 is a processing procedure for calculating and storing a word reference probability for each sentence of tagged document data stored in the document storage means by the CPU of the sentence unit search apparatus according to the first embodiment. It is a flowchart which shows order.
  • FIG. 10 is a flowchart showing a processing procedure in which the CPU of the sentence unit search device in Embodiment 1 calculates and stores a word reference probability for each sentence of tagged document data stored in the document storage means. It is.
  • FIG. 11 is an explanatory diagram showing an example in which the CPU of the sentence unit search device in Embodiment 1 sorts the document shown in the document data for each sentence.
  • FIG. 14 is an explanatory diagram showing how a set of words stored for each sentence by the CPU of the sentence unit search apparatus and a reference probability calculated for the word changes as the sentence continues.
  • FIG. 15 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 16 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 17 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the first embodiment.
  • FIG. 18 is an explanatory diagram showing an example of a feature pattern specified for text data in which the CPU of the sentence unit search device according to the first embodiment also receives the receiving device power.
  • FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the second embodiment.
  • FIG. 20 is an explanatory diagram showing an outline of the influence of the manifestation of a word closely related to one word, related to the search method of the present invention in Embodiment 3.
  • FIG. 26 is an explanatory diagram showing an example of the content of a weight value representing the manifestation of each word calculated by the CPU of the sentence unit search device in the third embodiment.
  • FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
  • FIG. 28 is a flowchart showing a processing procedure of search processing of the sentence unit search device and the reception device in the third embodiment.
  • FIG. 29 is a block diagram showing a configuration when the sentence unit retrieval method of the present invention is implemented by a sentence unit retrieval apparatus.
  • FIG. 1 is an explanatory diagram showing an outline of the sentence unit search method according to the present invention.
  • 100 in FIG. 1 represents a document set in which a plurality of document data is stored, and one document 101 obtained from the document set 100 is a sentence unit S 1,. ..., S, S, ...
  • 200 in FIG. 1 represents a conversation between user A and user B.
  • the conversation 200 between user A and user B is A set of utterances U, ..., U from time-series users A and B
  • the Conversations are made in the order of utterances U, U, U, U.
  • User A and User B 3 r2 ⁇ j
  • the sentence unit search method provides a degree of attention to each word at the time when the user writes or utters the sentence unit or the word as a quantitative weight value, and assigns it to each word,
  • weighted word groups that reflect the degree of attention to each successive word unit in time series or each word that changes from word to word as an index representing contextual meaning in each sentence unit,
  • the purpose is to directly search and output sentence units having the meaning of.
  • Conversation 200 in the example shown in the explanatory diagram of FIG. 1 is a conversation about travel to Kyoto between user A and user B.
  • Utterance U 200 in conversation 200 is “Kyoto” and “Travel”
  • “Kyoto” and “time” are attracting more attention than “line”, and user A and user B should be able to recognize in common that the contextual implications are changing. Furthermore, “Famous” and “Festival” appear in Utterance U. Considering only the time of U's utterance, the words “Kyoto”, “Travel”, “Time” and “Hot” do not appear. However, at least for user A, utterance U has the meaning of “festival” in “Kyoto” in the “summer” context! Therefore, even at the time of utterance U, “Kyoto” still has weight on contextual implications. It should be noted that user A who utters utterance U should at least recall “Gion Festival” as a word corresponding to the festival.
  • Sentence unit S in this context has the meaning of “Gion Festival” when it comes to “Kyoto” in July.
  • the sentence unit S has the meaning that it is “Gion Festival” or “Gion Festival” in “Summer”, “July”, “Kyoto”.
  • utterance U and sentence unit S are common to “summer”, “Kyoto” and “festival”. Has weights and similar contextual implications.
  • a sentence having a similar contextual meaning is estimated by estimating a group of contextual meanings from the preceding utterance that the user is aware of during the utterance U. Unit s directly
  • the purpose is to search and output k-wise.
  • the computer system can present a relevant information for each utterance and enter into the conversation.
  • the computer system can support the conversation between user A and user B.
  • the computer system outputs an audio message such as “Gion Festival in Kyoto in July” after utterance Uj by user A in conversation 100, user A And talk between User B and the computer system.
  • information such as “Gion Festival for Kyoto in July” is presented by the computer system, so that the conversation between user A and user B Support is also realized.
  • the computer system is made to execute the sentence unit search method according to the present invention.
  • the computer device stores the document data of the document set in advance in units of sentences, and stores quantitative information representing the contextual meaning of each sentence unit in the divided sentence units. Pre-processing including processing to be prepared is required.
  • processing for obtaining quantitative information representing the meaning of the utterance in the conversation flow, and sentence units having similar meanings based on the information obtained for the utterance are extracted.
  • a search process including a process of outputting and outputting as a search result is required.
  • Embodiments 1 to 3 described below a hardware configuration necessary for causing a computer device to execute the sentence unit search method according to the present invention will be described first. Furthermore, the processing by the computer apparatus will be explained step by step by distinguishing the preprocessing and the search processing. Specifically, in each embodiment,
  • Embodiments 1 to 3 described below as an example of executing the sentence unit search method according to the present invention, hardware that stores a document set of document data and an utterance are accepted.
  • a description will be given of a search system that includes a computer device and a computer device that executes a search process by connecting to a computer device that accepts utterances and nodeware that stores a document set.
  • each process and specific example are mainly shown in the case where the document set also has Japanese natural sentence power.
  • the sentence unit search method of the present invention can be applied not only to Japanese but also to other languages.
  • the grammatical handling specific to each language such as language analysis (morphological analysis, syntactic analysis), etc., uses the most appropriate method for each language.
  • FIG. 2 is a block diagram showing a configuration of a search system using the sentence unit search apparatus 1 according to the first embodiment.
  • the retrieval system includes a sentence unit retrieval device 1 that executes retrieval processing from document data, a document storage unit 2 that stores document data in natural language, a packet switching network 3 such as the Internet, and a user input. Consists of accepting devices 4, 4,... That accept keywords or words such as speech.
  • the sentence unit search device 1 is PC (Personal
  • the accepting devices 4, 4,... are also PCs, and the sentence unit retrieval device 1 is connected to the accepting devices 4, 4,.
  • the sentence unit search apparatus 1 stores document data including a sentence unit to be searched in the document storage unit 2 in advance.
  • the sentence unit search device 1 The document data stored in the document storage means 2 is classified in advance into sentence units, and quantitative information representing contextual meaning is stored in each sentence unit so that it can be searched later.
  • the receiving devices 4, 4,... Convert the received words into text data or voice data that can be processed by a computer, and transmit the data to the sentence unit searching device 1 via the packet switching network 3.
  • the sentence unit retrieval device 1 extracts one or more sentence units having sentence power from the document data stored in the document storage means 2 based on the received word data, and the extracted sentence units are transmitted via the packet switching network 3.
  • a sentence-by-sentence search is realized by outputting to the receiving devices 4, 4,.
  • the sentence unit search device 1 includes at least a CPU 11 for controlling various kinds of hardware, an internal bus 12 for connecting various kinds of hardware, a storage means 13 including a nonvolatile memory, and a volatile type.
  • Temporary storage area 14 consisting of memory, communication means 15 for connection to the packet switching network 3, document set connection means 16 for connection to the document storage means 2, and portable types such as DVDs and CD-ROMs And auxiliary storage means 17 using the recording medium 18.
  • the storage means 13 stores a control program IP acquired from a portable recording medium 18 such as a DVD or CD-ROM for the PC to operate as the sentence unit search device 1 according to the present invention.
  • the CPU 11 reads out and executes the control program 1P from the storage means 13, and controls various kinds of nodeware via the internal bus 12.
  • the temporary storage area 14 stores information temporarily generated by the arithmetic processing of the CPU 11.
  • the CPU 11 detects that the word data transmitted from the accepting devices 4, 4,... Is received via the communication means 15, executes processing based on the received word data, and performs search processing. Do. Further, the CPU 11 acquires the document data stored in the document storage unit 2 through the document set connection unit 16 and stores the document data in the document storage unit 2 through the document set connection unit 16. It is possible.
  • control program 1P stored in the storage means 13 obtained from the portable recording medium 18 such as a DVD or CD-ROM via the auxiliary storage means 17 is further stored in the storage means 13! Based on the dictionary information, it is possible to execute natural language analysis such as morphological analysis and syntactic analysis on document data expressed in character strings.
  • the accepting devices 4, 4,... include at least a CPU 41 for controlling various types of software, various types of software, Internal bus 42 for connecting the software, storage means 43 composed of nonvolatile memory, temporary storage area 44 composed of volatile memory, operation means 45 such as a mouse or keyboard, and display means 46 such as a motor 46 Voice input / output means 47 such as a microphone and a speaker, and communication means 48 for connection to the packet switching network 3.
  • CPU 41 for controlling various types of software
  • Internal bus 42 for connecting the software
  • storage means 43 composed of nonvolatile memory
  • temporary storage area 44 composed of volatile memory
  • operation means 45 such as a mouse or keyboard
  • display means 46 such as a motor 46
  • Voice input / output means 47 such as a microphone and a speaker
  • communication means 48 for connection to the packet switching network 3.
  • the storage means 43 stores a processing program for the PC to operate as the accepting devices 4, 4,.
  • the CPU 41 reads the processing program from the storage means 43 and executes it, the CPU 41 controls various nodewares via the internal bus 42.
  • the temporary storage area 44 information temporarily generated by the arithmetic processing of the CPU 41 is stored.
  • the CPU 41 can detect a character string input operation from the user via the operation means 45 and store the input character string in the temporary storage area 44.
  • the CPU 41 detects the voice input from the user via the voice input / output means 47, reads the voice recognition program stored in the storage means 43, and executes it as text data. Can be converted. Further, the CPU 41 can input the voice inputted by the user as voice data that can be processed by a computer through the voice input / output means 47.
  • the CPU 41 transmits text or voice word data obtained by detecting a character string input operation or voice input from the user to the sentence unit search device 1 via the communication means 48.
  • the CPU 41 may convert voice data into text data and transmit it, the CPU 41 utters features of voice data obtained by voice recognition, for example, phonemes corresponding to each word. You may also send data such as the speed at the time of being sent and the frequency of the phoneme corresponding to the word.
  • the CPU 41 also stores the time difference between the speech data corresponding to each word, and sends the time difference from the point in time when the word was included in the previously accepted word to the sentence unit search device 1. May be.
  • the sentence unit search apparatus 1 first prepares a document set as pre-processing, and later represents a group of meanings for each sentence unit included in each document data. Process to make it possible. "2. Document data acquisition and In ⁇ Language analysis '', the sentence unit search device 1 stores the document data in the document storage means 2, parses each document data into sentence units that have one or more sentences, The process of analyzing grammatical characteristics for each sentence and storing them in the document storage means 2 for each sentence unit will be described. In the first embodiment, a description will be given of a case where the sentence unit search device 1 uses one sentence as one sentence.
  • the CPU 11 of the sentence unit search device 1 stores document data including the sentence unit to be searched in the document storage unit 2 in advance.
  • the CPU 11 of the sentence unit search device 1 acquires the document data that can be acquired via the communication unit 15 and the packet switching network 3 by Web crawling, and stores it in the document storage unit 2 via the document set connection unit 16.
  • the CPU 11 of the sentence unit search device 1 classifies the document data acquired and stored in the document storage means 2 via the document set connection means 16 into sentence units, and performs language analysis (morphological analysis and syntactic analysis). ) And store the result in association with each sentence unit.
  • FIG. 3 is a flowchart showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment performs tagging and word extraction from the analysis results of the morphological analysis and syntactic analysis processing for the acquired document data and stores them. It is.
  • the processing shown in the flowchart of FIG. 3 is performed by extracting a word that appears in each sentence unit or a word that is referred to from the preceding sentence unit and a feature of each word in each sentence unit. This corresponds to the processing to be stored.
  • CPU 11 determines whether or not it has acquired document data (step SI 1). If the CPU 11 obtains the document data and determines that! /, N! / (SI 1: N 0), the CPU 11 returns the process to step S11 and waits until the document data is obtained. When CPU 11 determines that the document data has been acquired (S11: YES), CPU 11 attempts to read each sentence from the acquired document data and determines whether the reading has succeeded (step S12). .
  • the CPU 11 extracts words that appear in the analyzed sentence and words that are referred to from the preceding sentence in the sentence from the results of the morphological analysis and syntactic analysis, and stores them in the list (step S14). Further, as will be described later, the CPU 11 also generates a tag for the analysis result power (step S15), adds the tag to the read sentence, and stores it in the document storage means 2 via the document set connection means 16. (Step S16).
  • the above processing is performed every time document data is acquired !, and the tagged document data is stored in the document storage means 2.
  • FIG. 4 is an explanatory diagram showing an example of the contents of document data stored in the document storage means 2 in the first embodiment.
  • the document data stored in the document storage means 2 is stored in the HTML (HyperText Markup Language) obtained from a publicly accessible Web server connected to the packet switching network 3 via the communication means 15 by the CPU 11 of the sentence unit search apparatus 1. ) And other text data.
  • the example shown in Fig. 4 is also a document of HTML data that can be obtained from a web page published on the Internet (http: ⁇ ja.wikipedia.org / wiki / excerpt). In the following, this document example will be used to explain document analysis and retrieval.
  • the CPU 11 of the sentence unit search device 1 converts the character string in the acquired document data into the sentence unit language unit (sentence unit) in the sentence reading process in step S12 shown in the flowchart of FIG. Sort. For example, when the document data is composed of Japanese, the CPU 11 uses a character string representing a punctuation mark “.” Or a character string representing a period “.” If the document data is composed of English. You may sort by.
  • CPU 11 of sentence unit search device 1 performs morphological analysis based on dictionary information for the linguistic unit of "sentence”, identifies the morpheme that is the minimum constituent unit of the sentence, and determines the morpheme structure. To analyze. For example, in the document data shown in FIG.
  • the CPU 11 uses a noun such as “Festival” and “God Spirit”, a proper noun such as “Kyushu”, a verb such as “speak”, “ A morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”.
  • a noun such as “Festival” and “God Spirit”
  • a proper noun such as “Kyushu”
  • a verb such as “speak”
  • a morpheme is identified by collating with a particle string such as a particle such as “to” and “ha” and symbols such as “,” and “.”.
  • Various techniques for morphological analysis have been proposed today, and the present invention does not limit the morphological analysis techniques.
  • the CPU 11 of the sentence unit search device 1 uses the part of speech information (nouns, particles, adjectives, verbs, adverbs, etc.) for each identified morpheme, and Japanese grammar and English if it is a Japanese sentence.
  • syntactic analysis is performed to extract grammatical relationships between morphemes based on grammatical information that statistically obtains cohesiveness between parts of speech based on English grammar. For example, by applying a grammar to a tree structure, it is possible to extract the relationship between morphemes according to the tree structure.
  • the analysis target is (adjective + noun + particle + noun)
  • the subject of analysis applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective phrase. When it is determined that the first morpheme is an adjective, it is determined that the adjective is the largest modifier in the analysis target that modifies the noun that follows. In other words, the relationship (adjective + (noun)) is extracted.
  • the remaining analysis target is (noun). If it is determined that it consists of multiple morphemes and is not a noun, it is determined whether or not the remaining analysis target applies to (adjective + noun). Therefore, it is determined whether or not the first morpheme to be analyzed is an adjective.
  • the first morpheme to be analyzed is an adjective! , The adjective part of (adjective + noun) is expanded to (noun + particle), and it is determined whether or not the remaining analysis target applies to ((noun + particle) + noun).
  • the grammatical relationship between the morphemes of the analysis target (adjective + noun + particle + noun) is [adjective + ⁇ (noun + particle ) + Noun ⁇ ].
  • the method of syntactic analysis is not limited to the method based on such a method, but various methods are proposed today as well as the method of morphological analysis. Does not limit the method of syntactic analysis.
  • the CPU 11 of the sentence unit search device 1 generates document data in which the analyzed morphemes and the grammatical relationships between the morphemes are represented by tags based on XML (extensible Markup Language), and stores them in the document storage means 2.
  • the input character string is morphologically analyzed and further syntactically analyzed to indicate the part-of-speech information of each morpheme and the morpheme information And so on, for each morpheme that is classified.
  • the control program 1P stored in the storage means 13 of the sentence unit retrieval apparatus 1 is configured to allow the CPU 11 of the sentence unit retrieval apparatus 1 to execute the natural language analysis method.
  • the phrase number 0 is (0: Kyushu (noun + proper noun + region + —general, Kyushu, Yuyu) Z region (noun + —general, region, chihou) Z north (noun + —general, region, northern)
  • Z is (particle + case particle + —general, de)
  • Z is (particle + subject particle, is c) z, (symbol + punctuation)), and morphemes are identified and information is added.
  • the morpheme “Kyushu” is a noun, proper noun, a noun indicating the region, and is sometimes used as a general noun.
  • the basic form is “Kyushu”, and it can be determined that the pronunciation is “Kyushuyu”.
  • the dependency information is, for example, (0 2, 1 2, 2 —1) and the dependency relationship between phrases Can be obtained so that can be discriminated.
  • the clause number 0 is the clause number 2 clause
  • the clause number 1 clause is the clause number 2 clause.
  • the phrase number 2 can be identified by having a relationship destination of -1 because there is no dependency destination.
  • FIG. 5 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search apparatus 1 according to Embodiment 1 gives and stores in the document storage unit 2 the results of morphological analysis and syntactic analysis. is there. This corresponds to an example of the document data stored in the document storage means 2 by executing the processing procedure shown in the flowchart of FIG. 3 on the document data having the contents shown in FIG.
  • the CPU 11 of the sentence unit search apparatus 1 sorts a part of the document shown in Fig. 4 into morphemes such as proper nouns, nouns, particles, verbs, etc. Relevance is expressed by nesting tags.
  • the example shown in Fig. 5 is based on the tagging method according to the rules proposed by GDA (Global Document Annotation; see http://i_content.org/gda). The present invention is not limited to complying with the rules. If the computer can identify morpheme information and dependency information between morphemes by information processing, the method is not limited to XML tagging.
  • the tag indicated by ⁇ 511> is a tag representing a sentence (Sentential unit).
  • the sentence “In the northern part of Kyushu is sometimes referred to as (O) kun in the autumn” is the sentence “in the northern part of Kyushu”, “ It can be identified by the tag that it has a unit of three clauses of “There is” and “.
  • the tag indicated by & (1> is a tag that indicates a particle other than the final particle (part icle), adverb, adjunct, etc.
  • the tag indicated by ⁇ n> indicates a noun
  • the tag indicated by ⁇ v> indicates a verb
  • the attribute represented by the attribute name syn indicates a dependency relationship between language units such as clauses or words sandwiched between tags to which the attribute is assigned.
  • Attribute value f (forward) is assigned This means that the linguistic unit that constitutes the sentence is closest to the subsequent linguistic unit. Therefore, in principle, phrase 0 “in the northern part of the Kyushu region” relates to phrase 1 “when it is called (O) kunchi for what happens in the fall”, The term “te (kun)” refers to “Yes” in clause 2.
  • the tag indicated by ⁇ n> can be shown as not being a word on the side where the dependency is received by setting np>.
  • “North Kyushu” can be classified into “Kyushu”, “Region”, and “North”, respectively, with morphemes sandwiched between n>, because “Kyushu” is related to “Region” and “Region” is related to “North”. "Is unnecessary.
  • “events (events, events), festivals”, “events (events, events)” are related to “no” regardless of “festival”. With>, the dependency relationship can be shown.
  • a proper noun representing a place such as "Kyushu” or a proper noun representing a person's name such as "Taro” can be indicated by a tag of placename> ⁇ pername>, respectively.
  • a morpheme referenced from a preceding word or sentence such as a demonstrative pronoun or a zero pronoun can be expressed using an attribute indicating an anaphoric relationship.
  • the attribute name id can be used to indicate whether the pronoun or zero pronoun indicates the preceding word or sentence. For example, for a sentence “There is a button on the right side, please press it”, if a human reads this, it can be naturally supplemented that “it” refers to a “button”. Shi However, when it is processed by a computer, it is not possible to determine what it is to show that “it” can be identified as a directive pronoun by checking against dictionary information.
  • the corresponding relationship can be indicated by the id attribute, the eq attribute, and the obj attribute described above.
  • np id “ Btn ”> button“ Znp> ”on the right side.”
  • ⁇ Np 6 '1 1,. > Is marked with an X.
  • the second sentence "it” indicates "button” and the third sentence " It can indicate that the object of “push” is a “button”.
  • information indicating the result of morpheme analysis is added to the attribute information of a tag such as n> ⁇ ad> ⁇ v> that sandwiches each morpheme with the attribute name mph.
  • the attribute value indicates part-of-speech information, basic form information, pronunciation information, etc. of the morpheme obtained by morpheme analysis.
  • additional information, part-of-speech information, inflected form information, basic form information, and pronunciation information are attribute values
  • mph “additional information; part-of-speech information; inflected form information; basic form information; Information ".
  • “Kyushu” uses part of speech information as a noun + proper noun + It can be classified by region + —general, the basic form is Kyushu, and it can be pronounced “Kyuushiyu” and is clearly indicated by the mph> tag.
  • identification information such as chasen is added as additional information of the morpheme.
  • the CPU 11 of the sentence unit search apparatus 1 tags the document data obtained by Web crawling with the results of tagging the results of morphological analysis and syntactic analysis according to GDA rules.
  • Certain XML data is stored in the document storage means 2 via the document set connection means 16.
  • the CPU 11 of the sentence unit search apparatus 1 identifies the tag of the document data by character string analysis, and identifies the attribute information attached to the tag to identify each attribute data. Can identify morpheme information and grammatical relationships.
  • FIG. 6 is an explanatory diagram illustrating an example of a list of extracted words for all document data acquired by the CPU 11 of the sentence unit search device 1 according to the first embodiment.
  • 31245 words are listed. It should be noted that common words such as “thing” and “thing” are excluded from the stored words. This is because the word is too general like a conjunction or article, and although it appears frequently, the word itself does not make sense, so the search processing is burdensome and inappropriate as a search target.
  • the CPU 11 of the sentence unit search device 1 specifies information that quantitatively represents a group of meanings of the sentence for each sentence in the document data stored in the document storage unit 2.
  • Information that quantitatively expresses the meaning of a sentence means a group of words that the user is paying attention to when the user uses the sentence (speaking, writing, listening, or reading), and the user pays attention to each word. This is expressed by a value (word weight value) that quantitatively indicates the degree of salience.
  • each word in the sentence depends on the frequency of appearance that has been achieved by conventional search services. Therefore, it can also be quantified. However, the appearance frequency is obtained based on the document or the entire document set. Therefore, by calculating the appearance frequency of each word for each document, it is possible to quantitatively represent the meaning of the whole document, but the context changes dynamically according to the flow in the document. It cannot represent a set of meanings that reflect
  • the manifestation of a word in a sentence is grammatically defined by the degree of attention of the word in the preceding sentence and the transition of the degree of attention of the word in the current sentence depending on how the word is used. It can be expressed separately. In other words, if the word that was the subject (subject) in the preceding sentence is also the subject (subject) in the current sentence, the word is the most noticeable in the current sentence, and it is highly obvious. Yes, there is. On the other hand, words that appear in the preceding sentence! /, Na, are the subject (subject) in the current sentence, although they are attracting attention in the current sentence, but continue to be used as the subject mentioned above. It can be said that the obviousness is low.
  • This manifestation formula ⁇ has been studied as a centralized theory (Grosz e t al., 1995, Nariyama, 2002, Poesio et al., 2004).
  • the manifestation of each word is not represented as a feature value for quantitative calculation by a computer or the like. It is only possible to determine whether the transition of each word belongs to one of the transitions defined by the centralization theory. Therefore, the present invention quantitatively calculates the manifestation of each word in each sentence.
  • the reference probability for each sentence is calculated for each word, and the calculated reference probability is assigned as a weight value representing the manifestation of each word for each sentence.
  • the reference probability that a word appears or is referenced in a subsequent sentence is a word that can be analyzed by information processing by the sentence unit search device 1 that does not feature the meaning of a word that is difficult to handle quantitatively.
  • a feature pattern that includes a pattern that appears or includes a reference pattern is identified, and the percentage of words that appear or referred to in the same feature pattern as the specified feature pattern actually appear or referred to in subsequent sentences is used as the reference probability. Calculated.
  • the reference probability for each word is defined as a weight value for each word, and each weight value is assigned.
  • a set of words in the given sentence is called a weighted word group.
  • a group of meanings for each sentence unit can be expressed by a weighted word group to which a quantitative weight value called a reference probability is given.
  • the number of occurrences of the same feature pattern as the specified feature pattern is calculated as the reference probability of the ratio of the same feature pattern in which the word actually appears or referenced in the subsequent sentence.
  • the reference probability can be calculated statistically without any problem.
  • the actual number of identical feature patterns is limited, and enormous amounts of document data are required to calculate reliable reference probabilities. Therefore, a regression equation for predicting whether or not a subsequent sentence appears or is referenced from the feature pattern of a word that is a factor of the occurrence of the event is used as a feature pattern and actually appears or referenced in the subsequent sentence. This is obtained by learning a regression model with the events.
  • Sentences in the document data stored in the document storage means 2 are sandwiched between tags indicated by ⁇ su>, and words that appear in the sentence, or words that have an anaphoric relationship with a pronoun or zero pronoun in the sentence, It can be specified by tag attribute information. Therefore, in the sentence unit search device 1 of the present invention, the feature pattern is specified as follows for the document data stored in the document storage means 2.
  • a sample (s, w) is a pair of one sentence s in the document data and a word w included in a sentence preceding the one sentence in the document data.
  • the feature pattern f (s, w) for the sample is specified by the following feature amount.
  • the feature amount (gram) and the feature amount of the number (chain) in which the word w appears or is referenced in the sentence preceding the sentence s can be given as examples.
  • the feature amount is not limited to this, and may be whether or not the word w is a word indicating a recent topic, or whether or not the word w is a personality.
  • the results of morphological analysis and syntactic analysis are described by tags conforming to the GDA, so they are delimited by the tag ⁇ su> by character string analysis of the document data.
  • Sentence classification and counting, identification of particles based on part-of-speech information indicated by tags within each sentence, and counting of the number of occurrences of words including those referred to by demonstrative pronouns or zero pronouns are possible. Therefore, the CPU 11 of the sentence unit search device 1 can specify the feature quantities dist, gram, and chain for each sample by analyzing the tag and its attribute value according to GDA.
  • the CPU 11 of the sentence unit search device 1 extracts a sample from the tagged document data stored in the document storage means 2, and obtains a feature amount from the extracted sample to identify a feature pattern.
  • the processing procedure for estimating the regression equation by regression analysis is also described to calculate the reference probability of the feature pattern force of the extracted sample.
  • FIG. 7 shows a case in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 extracts a tagged document data force sample stored in the document storage means 2 and performs a regression analysis to calculate a reference probability. It is a flowchart which shows the process sequence which estimates these regression equations.
  • the process shown in the flowchart of FIG. 7 includes a process for identifying a feature pattern for each sentence unit, and a result of determining whether or not the feature pattern and the identified word appear or are referenced in subsequent sentence units. This corresponds to the process of performing regression learning to calculate the reference probability based on.
  • the CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S21).
  • the CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis and sorts it into sentences (step S22).
  • the CPU 11 identifies each tag in su> indicating the sentence by character string analysis, and extracts a sample by associating the word appearing in the sentence or the word to be referred to with the sentence (step S23).
  • a tag is identified by character string analysis for the extracted sample, and a feature pattern consisting of d 1st, gram, and chain is specified (step S 24).
  • CPU 11 determines whether or not the separated sentence is the end of the acquired document data (step S25), and if CPU 11 determines that the separated sentence is not the end of the document data (S 25: NO), CPU 11 returns the process to step S22, and continues the process of sorting by identifying the su> tag in the subsequent sentence. Whether the sorted sentence is the end of the acquired document data is determined by, for example, whether or not it is a force that the tag is followed by SU> ⁇ /SU> that includes the currently sorted sentence. If it is determined that it does not follow, it can be determined that it is the end.
  • step S26 determines whether or not extraction of a predetermined number of samples is completed.
  • CP Ul 1 determines that sample extraction is complete! /, N! / (S26: NO)
  • CPU 11 returns the process to step S21 to obtain different tagged document data. Continue sample extraction.
  • the CPU 11 determines that the sample extraction is completed (S26: YES)
  • the CPU 11 performs a regression analysis on the extracted sample and obtains a regression equation for each feature quantity dist, gram, and chain. Estimate the regression coefficient (step S27) and end the process.
  • FIG. 8 is an explanatory diagram showing an example of a feature pattern identified by a sentence in the document data stored in the document storage unit 2 according to the first embodiment.
  • the characteristic pattern f (s, Taro-kun) of the sample (s, Taro-kun) of the sentence s in the sentence s shown in Figure 8 and the word “Taro-kun” in the preceding sentence is as follows: Identified.
  • the distance feature (dist) between the current sentence Si and the sentence s where the word “Taro-kun” appeared or referred to recently in the preceding sentence is immediately after s.
  • step S27 shown in the flowchart of FIG.
  • regression analysis is performed based on a logistic regression model.
  • the regression analysis is not limited to this, and other regression analysis methods such as kNN (k—Nearest Neighbors) smoothing + Support Vector Regression (SVR) model may be used.
  • kNN k—Nearest Neighbors
  • SVR Support Vector Regression
  • the regression model can be learned using the following 8 elements as the feature quantities of the feature pattern that can be handled.
  • the following 5 elements can be handled as feature values in addition to the dist, gram, and chain described above.
  • One may be the type of noun (exp, pronoun: 1Z non-pronoun: 0) when the word w is referenced within the previous sentence unit.
  • Another one may be whether the word w is the subject when it appears or is referenced in the previous sentence unit (last-topic, yes: lZno: 0).
  • the other may be whether the word w is the subject when it appears or is referenced in the preceding sentence unit (last—sbj, yes: l / no: 0).
  • the other one may be whether the word w is a personal person (pi, yes: l / no: 0) in the sample, w).
  • Another one may be the part of speech information (pos, noun: 1, verb: 2, etc.) of the word w in the immediately preceding sentence unit when the word w appears or is referenced.
  • Another one may be whether the word w is referenced in the title or heading in the document (in_header, yes: lZno: 0).
  • one of eight elements is the time-dist of the nearest reference location of the word (time—dist), the latest reference of the word.
  • Speaking speed per syllable of the phrase containing the phrase ratio to the average of the speakers) (syllable-speed), frequency ratio of the lowest utterance pitch and the highest utterance pitch of the phrase including the reference part closest to the word Any one or more of (pitch—fluct) can be used.
  • the feature amount of the voice data Even if the regression analysis is performed, the CPU 11 of the sentence unit search device 1 receives the voice data as the word data as will be described later, the feature amount power also calculates the reference probability. can do. [0147] As described above, when the kNN smoothness + SVR model is used, the reference probability can be calculated based on a more detailed feature amount, and a more precise reference probability can be calculated.
  • the word w actually appears or is referenced in the sentence s following the sentence s.
  • the logistic regression model is used for all samples (s, w). Regression analysis. As a result, when the feature quantities dist, gram, and chain are given, the regression equation for calculating the probability Pr (s, w) that the word w appears or is referenced by s i + 1 i + 1
  • the probability obtained by the Logistic Regression model is generally obtained by the following equation (1) with respect to the explanatory variables (features) xl, x2, ⁇ , xn.
  • the regression analysis of the reference probability of the word W in the sentence s calculated by the present invention means that the explained variable is 0, the sample that does not appear or is referenced in the subsequent sentence s, is 0 or appears or is referenced
  • the sample is set to 1, and the explanatory variables are dist, gram, and chain, which are feature quantities, and the extracted samples are learned to estimate the parameters (regression coefficients) b, b, b, and b in the following equation (2) To do
  • Equation (3) that applies these parameters is a recursive equation for obtaining the reference probability.
  • the estimated parameters differ depending on whether the document data stored in the document storage means 2 is useful only for newspaper articles that are written words or only if the utterances that are spoken words are converted to document data.
  • the estimated parameter values b 1, b 2, b 3, and b differ depending on the amount of the document data and the content of the document data.
  • document data is stored separately for written language and spoken language, and parameters are estimated by regression analysis even for document data with spoken language ability. Then, the regression equation for calculating the reference probability is stored. If the words accepted by the accepting devices 4, 4,... Are limited to texts that are written and that can be written by text input instead of speech, the document data is spoken and written. Alternatively, the document storage means 2 may store them without distinguishing them.
  • the CPU 11 of the sentence unit search device 1 can calculate the reference probability of the word having the feature pattern by specifying the feature pattern having the feature quantity dist, gram, and chain power of each word in the sentence unit. .
  • the CPU 11 of the sentence unit search device 1 calculates the reference probability for each word by specifying the feature quantities dist, gram, and chain for each word extracted for each sentence unit. can do. Therefore, the CPU 11 of the sentence unit search device 1 acquires the tagged document data stored in the document storage means 2 and classifies the data for each sentence. A feature pattern is specified for the word that appears in or the word to be referenced, and the reference probability is calculated. As a result, it is possible to quantitatively represent a group of meanings for each sentence that reflects the contextual meaning of the preceding sentence.
  • the CPU 11 of the sentence unit search device 1 acquires the document data stored in the document storage means 2, and for each sentence included in the document data, grammatical of each word in the sentence and the preceding sentence. Specific feature patterns are identified, and the reference probabilities for each word are calculated for each sentence based on the identified feature patterns and regression equations, and stored in advance.
  • the CPU 11 of the sentence unit search apparatus 1 stores a pair of each word and the reference probability of each word (weighted word group) in association with each sentence unit. That is, the CPU 11 performs processing for storing all the texts of all the documents acquired from the document set. On the other hand, the CPU 11 extracts a sentence whose contextual meaning is similar to the accepted word in all the sentences of all the documents in a later search process. Therefore, in this case, it takes a heavy processing load to read out all the sentences of all the documents one by one and read out the weighted word group representing the contextual meaning of each sentence associated with each.
  • the CPU 11 of the sentence unit search apparatus 1 reads out the weighted word group representing the contextual meaning of the preceding sentence for each sentence one by one in the subsequent process.
  • the weighted word group calculated for each sentence is converted into a database and indexed.
  • FIG. 9 and FIG. 10 show that the CPU 11 of the sentence unit search apparatus 1 in Embodiment 1 calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2. It is a flowchart which shows the process sequence to take out and memorize
  • the process shown in the flow charts of FIGS. 9 and 10 is a process for calculating a reference probability using a feature pattern identified for each word and a regression coefficient corresponding to the feature pattern for each sentence unit. This corresponds to the process of storing the calculated reference probabilities in pairs with words.
  • the CPU 11 of the sentence unit search device 1 sends the document storage means 2 to the document set connection means 16 via the document set connection means 16.
  • the tagged document data is acquired (step S301).
  • CPUll identifies the tag “ SU >” added to the acquired document data by character string analysis and classifies it into a sentence (step S302).
  • CPUl l identifies each tag in su> indicating the sentence by character string analysis, extracts words that appear in the sentence or words that are referenced in the sentence (step S3 03), and extracts the document. While the reference probability is calculated for the data, the extracted word is stored in the temporary storage area 14 (step S304).
  • the CPU 11 identifies the tag added to the word by word analysis for the word of the document data including the sentence stored in the temporary storage area 14, and also has a dist, gram, and chain force. Is identified (step S305). Next, CPUll calculates the reference probability by substituting each feature quantity of the identified feature pattern into equation (3) (step S306).
  • CPUll determines whether or not the power of calculating the reference probability of each word for the sentence for all the words stored in temporary storage area 14 (step S307). If CPU11 determines that reference probabilities have not been calculated for all words (S307: NO), CPU11 returns processing to step S305 to identify feature patterns for other words and determine reference probabilities. Continue calculation. On the other hand, if the CPU 11 determines that the reference probabilities have been calculated for all the words (S307: YES), the CPU 11 sets the word stored in the temporary storage area 14 and the reference probabilities calculated for each word. The (weighted word group) is stored with the salience attribute added (step S308). At this time, the CPU 11 narrows down the reference probability by a predetermined value, and does not memorize words having a reference probability less than the predetermined value.
  • the CPU 11 performs indexing and weighting so that a set of words and reference probabilities for each word (weighted word group) attached to the current sentence can be extracted later.
  • Store in the word group database step s309).
  • the CPU 11 may store the database in the storage unit 13 or may store it in the document storage unit 2 via the document set connection unit 16.
  • the CPU 11 executes the following process as one of the indexing processes.
  • the CPU 11 pays attention to the reference probability of one word in the weighted word group obtained in step S308, and determines whether or not the reference probability of the one word is greater than or equal to a predetermined value. Next, the CPU 11 determines whether or not the reference probability of another word in the weighted word group is a predetermined value or more.
  • CPU11 refers the calculated weighted word group to one word If a group has a probability greater than or equal to a predetermined value, a group with a reference probability of one word less than a predetermined value, and belongs to a group with a reference probability of one word greater than or equal to a predetermined value, then another word It is determined whether the group belongs to a group having a reference probability equal to or higher than a predetermined value or a group having a reference probability of another word lower than a predetermined value.
  • the CPU 11 determines to which group the weighted word group calculated by repeating such processing belongs, and stores it in association with the identification information of the group to which it belongs. For example, a kd tree search algorithm can be applied to this indexing process.
  • CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to the entire sentence in the document data acquired in step S301 is completed (step S310).
  • the CPU 11 determines whether or not the process of associating the weighted word group for each sentence with respect to all sentences in the document data is as follows. For example, after su> ⁇ Zsu> that sandwiches the current sentence, it is determined whether or not it is followed by a su> tag. If it is determined that it does not follow, it can be determined to be the end. If CPU 11 determines that the process of associating the weighted word group for each sentence is not completed for all sentences in the document data acquired in step S301 (S310: NO), CPU 11 returns the process to step S302. Continue processing for the next sentence.
  • step S31 1 determines that the processing for associating the weighted word group for each sentence is completed for the entire sentence in the document data acquired in step S301 (S31 0: YES).
  • the CPU 11 extracts the document data. Then, the word stored in the temporary storage area 14 is deleted (step S311).
  • the CPU 11 determines whether or not the process of storing the word and the word reference probability with the salience attribute is completed for all document data (step S312). If CPU11 determines that the process of storing the word and the word reference probability with the salience attribute has not been completed for all the document data (S312: NO), CPU11 returns the process to step S301 and The document data is acquired and processing continues. If the CPU 11 determines that the processing of storing the word and the word reference probability by the salience attribute is completed for all document data (S312: YES), the CPU 11 calculates the word reference probability and stores it in advance. The memorizing process is terminated.
  • FIG. 11 is an explanatory diagram showing an example in which the CPU 11 of the sentence unit search apparatus 1 according to the first embodiment classifies the document shown in the document data for each sentence.
  • the CPU 11 of the sentence unit search device 1 identifies ⁇ su> tags from the document data stored in the document storage means 2 and separates them for each sentence by the processing of step S301 and step S302.
  • the sentence is s “Festival is a ritual that enshrines spirits, etc.”, s “Festival, ritual
  • the word from which the sentences s, s, and s force are also extracted by the processing of step S303 by the CPU 11 of the sentence unit retrieval apparatus 1 is the word stored in the word list.
  • the CPU 11 of the sentence unit search apparatus 1 uses the sentence s of each word group by the process of step S305.
  • the feature pattern consisting of the feature quantities dist, gram, and chain of each word group is specified. For example, “Kyushu” (identification number: 9714) in sentence s (
  • the characteristic pattern (see Fig. 6) is specified as follows.
  • the CPU 11 of the unit search device 1 calculates the reference probability by substituting the values of the feature quantities dist, gram, and chain into the equation (3) by the process of step S306 in the flowcharts of FIGS.
  • Equation (4) the reference probability of “Kyushu” in the sentence s is calculated as 0.238.
  • the reference probability is stored for the sentence s.
  • CPU11 of sentence unit search device 1 The reference probability is stored for the sentence s.
  • the word is represented by an identification number stored in a list, and the reference probability is stored in association with it.
  • the attribute name salience is defined for the su> tag that separates sentence units, and the attribute value is defined as a list of word identification numbers and reference probabilities. Stores the word and the reference probability (weighted word group) of the word.
  • FIG. 12 is an explanatory diagram showing an example of document data that the CPU 11 of the sentence unit search device 1 according to the first embodiment gives the result of calculating the reference probability and stores the result in the document storage unit 2.
  • the reference probability of “Kyushu” (9714) (weight value in sentence s, and so on) is 0.238
  • the reference probability of “North Kyushu” (9716) is memorized as 0.1159,...
  • FIG. 13 is an explanatory diagram showing an example of the contents of a database when the CPU 11 of the sentence unit search device 1 according to Embodiment 1 indexes and stores weighted word groups calculated for each sentence unit. .
  • the content example in FIG. 13 is associated with the sentence s shown in the content example in FIG.
  • the CPU 11 stores the weighted word group in association with information (k-d tree node ID) indicating to which group it belongs. Further, at that time, the CPU 11 identifies the file name of the tagged document data and the position in the document data so that the weighted word group is associated with the sentence unit of the misaligned document data.
  • tag information This makes it easy to extract sentence units associated with weighted word groups similar to the weighted word groups obtained for words received in later processing.
  • FIG. 14 shows how the set of words stored for each sentence by the CPU 11 of the sentence unit search apparatus 1 and the reference probabilities calculated for the words change as the sentence continues.
  • FIG. 14 context continues in time series as sentence s, sentence s, sentence s, sentence s continue.
  • the search process is based on the reception of keywords such as keywords or speech input by the receiving devices 4, 4,. Start as a point.
  • the CPU 41 of the accepting device 4 detects a character string input by the user via the operation means 45 and stores it in the temporary storage area 44, or a voice input by the user via the voice input / output means 47. Can be detected, converted into a character string, and stored in the temporary storage area 44.
  • the CPU 41 of the accepting device 4 has a function of analyzing a character string input by the user and separating it into one sentence and one sentence. For example, a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified.
  • a predetermined character such as a period “.” In Japanese or a period “.” In English may be identified and classified.
  • each time the Enter key is pressed is detected via the operation means 45, the character string until the Enter key is input may be separated from one sentence.
  • the voice may be converted into a character string by the voice recognition function, and may be classified into sentences by the converted character string analysis. May be separated.
  • the CPU 41 of the accepting device 4 transmits the sorted sentences and sentences as text data to the sentence unit retrieval device 1 via the communication means 48.
  • the CPU 11 of the sentence unit search device 1 receives text data indicating the words accepted by the accepting devices 4, 4,..., It searches for sentences in the document stored in the document storage means 2. Processing to be performed will be described.
  • text data indicating accepted words quantification of meaning groups is performed, that is, word extraction of the text data and calculation of word reference probabilities.
  • information indicating a group of meanings reflecting the context corresponding to the flow from the preceding words in the user's latent consciousness when the user inputs words can be used as a search request in the search processing described later. Can be created automatically.
  • the temporary storage area 14 stores the text data.
  • Text data is stored in the order received, and morphological analysis and syntactic analysis are performed on the sentence indicated by the received text data.
  • the CPU 11 of the sentence unit retrieval apparatus 1 identifies the characteristic notation f (s, w) of the word w in the sentence s of the received text data, the identified characteristic pattern and the previously obtained regression equation Based on this, the reference probability is calculated.
  • the CPU 11 of the sentence unit search device 1 calculates a reference probability for each word, and uses the word and the reference probability calculated for each word to store a weighted word group that is already stored in association with the sentence unit. In other words, a sentence-by-sentence search is performed by comparing each word with a set of reference probabilities for each word.
  • the CPU 11 of the sentence unit search device 1 can receive not only text data but also speech data of utterances input by the user from the reception devices 4, 4,. In this case, the same processing is performed by specifying the grammatical feature pattern of the words expressed in the voice data as in the text data.
  • speech data it is also possible to treat features obtained from speech data as features for determining whether or not the word is highly apparent. For example, when a word appears or is referenced, the CPU 11 can treat the time difference from the appearance or reference of the preceding word as one feature quantity. Further, the CPU 11 can treat the speech speed and the Z or speech frequency when the word is uttered as other feature quantities in the latest preceding words where the word appears or is referenced.
  • the accepting device 4 accepts a word input from the user and sends it to the sentence unit retrieval device 1, and the document storage means 2 uses the CPU 11 of the sentence unit retrieval device 1 based on the text data received from the acceptance device 4.
  • a processing procedure for storing and searching from document data will be described with reference to a flow chart.
  • FIG. 15, FIG. 16, and FIG. 17 are flowcharts showing the processing procedure of the search processing of the sentence unit search device 1 and the reception device 4 in the first embodiment.
  • the CPU 41 of the accepting device 4 determines whether or not the user has detected a character string input operation via the operation means 45, or whether the user has detected a voice input via the voice input / output means 47. Judgment is made (step S401). If the CPU 41 determines that the character string input operation or voice input by the user has not been detected (S401: NO), the CPU 41 returns the process to step S401 and detects the character string input operation or voice input by the user. Wait until [0193] On the other hand, if the CPU 41 of the receiving apparatus 4 determines that the user has detected a character string input operation or a voice input (S401: YES), the CPU 41 of the receiving apparatus 4 receives the input character string or voice input. From the converted character string, the input words are separated into one sentence and stored in the temporary storage area 44 (step S402), and the input words are also transmitted to the sentence unit search device 1 via the packet switching network 3 (step S402). Step S403).
  • the CPU 11 of the sentence unit search device 1 receives the word input by the user from the reception device 4 (step 3404). 1; 11 stores the received words as text in the temporary storage area 14 as text data in the order of reception (step S405). At this time, a sentence identification number may be added to each text data and stored.
  • the CPU 11 performs morphological analysis and syntactic analysis on the stored text data (step S406), and stores the words extracted by the analysis in the temporary storage area 14 (step S407). At this time, the CPU 11 compares the word stored in the list with the identification number of the list and stores the word.
  • step S407 of the sentence unit search device 1 the temporary storage area 14 stores words that have appeared or referred to once in a series of words (utterances) input. become.
  • word extraction in step S407 is not necessarily performed. In that case, a feature pattern specific process described later is performed on all words stored in the list.
  • CPU 11 calculates a feature pattern based on the text data received and stored in the past and the results of morphological analysis and syntactic analysis in step S406. Identify (step S408). The CPU 11 substitutes the feature quantity of the identified feature pattern into a regression equation for calculating a reference probability obtained by performing regression analysis on the spoken language in advance, and calculates a reference probability for each word (step S409). The CPU 11 determines whether or not the reference probabilities have been calculated for all the words stored in the temporary storage area 14 (step S410). If the CPU 11 determines that the reference probabilities have not been calculated for all the words stored (S410: NO), the process returns to step S408 to specify the feature pattern and reference probabilities for other words. The calculation process is performed.
  • the reference probabilities are calculated and stored in the temporary storage area 14 respectively, and the words having the reference probabilities of a predetermined value or more are narrowed down (step S411). This is to reduce the load on the CPU 11 itself by the subsequent calculation by removing words with extremely low reference probabilities.
  • the CPU 11 performs the following search processing based on the words narrowed down to the accepted words and the word reference probabilities.
  • search for pairs of words and word reference probabilities that quantitatively represent the group of semantic meanings that follow the previously accepted word power
  • the following search processing uses a weighted word group obtained for received words and a weight word group for each sentence stored in advance. Comparing and determining whether or not words and sentences have similar meanings based on whether the weight value distributions of multiple words in each weighted word group are similar, and extracting similar sentences It is an example of the process to perform.
  • the CPU 11 reads from the database of the storage means 13 or the document storage means 2 a pair of words and word reference probabilities stored in association with each sentence (hereinafter referred to as a weighted word group) (Ste S412).
  • the weighted word group associated with the accepted word obtained by the processing up to step S411 is stored in the database so that the CPU 11 can narrow down and read the weighted word group somewhat similar. Similar to the weighted word group stored in, it is determined which group it belongs to. The CPU 11 reads the database power of the weighted word group of the group to which the weighted word group associated with the received word belongs. As a result, it is possible to avoid comparison with weighted word groups that are not similar at all, and to narrow down and extract weighted word groups that are somewhat similar.
  • the CPU 11 extracts a weighted word group including the same words as the weighted word group of the received word from the weighted word group read out in step S412 (step S413).
  • the CPU 11 calculates a reference probability difference for each word that is the same as the extracted sentence (step S414).
  • CPUl l assigns similarities to the extracted weighted word groups in descending order of the number of identical words and the difference in reference probability S of the same words (step S4 15), and the extracted weighted word groups Whether the sentence associated with is a document set document data (Step S416).
  • the CPU 11 may read a sentence corresponding only to a weighted word group having a similarity equal to or greater than a predetermined value.
  • the CPU 11 sorts the extracted sentences by similarity (step S417).
  • the CPU 11 transmits text data representing each sentence as text data of the search result to the accepting device 4 via the communication means 15 (step S418).
  • the CPU 41 of the accepting device 4 receives the text data of the search result via the communication means 48.
  • Step S419) the received text data is displayed on a monitor or the like via the display means 46 (Step S420), and the process is terminated.
  • the CPU 41 of the accepting device 4 transmits text data or speech data separated into one sentence to the sentence unit searching device 1 each time an input of a user power word is detected.
  • the CPU 11 of the sentence unit search device 1 calculates a word and a reference probability for each word each time it receives text data or voice data, or information transmitted together with the voice data from the reception device 4, and converts it into a word received from the user.
  • information representing a group of meanings reflecting the flow of preceding word power that is, a weighted word group is created as a search request.
  • the CPU 11 of the sentence unit search device 1 extracts sentence units from the stored document data based on the search request (weighted word group) created for the accepted words, and sends the text data as the search results.
  • the CPU 41 of the accepting device 4 in the first embodiment displays the text data of the search result on the monitor or the like each time it is received. Therefore, every time a user-spoken word is input, the reception device 4 displays text data similar in meaning to that word as a search result.
  • the receiving device 4 does not necessarily have to be configured to transmit text data each time a user spoken word is input and to receive and display a search result. For example, a configuration in which text data or voice data corresponding to a plurality of words input during a predetermined period is transmitted to the sentence unit search device 1, and search results corresponding to the plurality of words are received and displayed. Good.
  • FIG. 18 is an explanatory diagram showing an example of a feature pattern identified for text data received from the receiving device 4 by the CPU 11 of the sentence unit searching device 1 according to the first embodiment. Sentence unit S, sentence unit S, and sentence unit S in Fig. 18 are indicated by the received text data.
  • the regression analysis is performed on the document data stored in the document storage means 2, and when the feature pattern is specified, the reference probability is calculated by substituting the feature amount.
  • a regression equation that can be used is derived in advance. Therefore, the CPU 11 of the sentence unit search device 1 can calculate the reference probability for the “snoopy” of the sentence s based on the feature quantities dist, gram, and chain of the identified feature pattern. Further, the CPU 11 of the sentence unit search device 1 calculates a reference probability including a word that has appeared or referred to in the past for the sentence s, and obtains a word and a reference probability of the word.
  • the CPU 11 of the sentence unit search device 1 determines the reference probability of the same word from the sentence unit in which the salienc attribute stored in the document storage unit 2 is stored in advance based on the obtained word and the reference probability. A sentence unit that is greater than or equal to is directly extracted. The CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15.
  • the meaning of words represented by the received text data is expressed by word and word reference probability (weight value) for each word. be able to.
  • word and word reference probability weight value
  • a word representing a group of meanings and word reference probabilities are stored. Sentences whose meanings are similar can be directly searched based on whether or not the extracted words have similar reference probabilities.
  • a pair (weighted word group) of the extracted word and the reference probability calculated for each word is used as the manifestation vector. deal with. Furthermore, a pair (weighted word group) of a word calculated for an accepted word and a reference probability calculated for each word is also treated as a manifestation vector. Then, at the stage of the search process, as shown in the first embodiment, the weight value distribution of the plurality of words in the weighted word group of the accepted words and the weighted word previously associated with each sentence.
  • each weighted word group is represented by a manifestation vector, and whether or not the condition is a similar condition is determined by the shortness of the distance between the manifestation vectors.
  • information that quantitatively represents a group of meanings for each sentence is used by the user when the user uses the sentence (speaking, writing, listening, or reading). It is expressed as a group of words that the user is interested in, and a value (word weight value) that quantitatively indicates the degree to which the user pays attention to each word, that is, the salience.
  • word weight value a value that quantitatively indicates the degree to which the user pays attention to each word.
  • the manifestation Use a reference probability that indicates the probability that it will appear or be referenced in subsequent sentences as a quantitative weighting value.
  • the reference probability includes the regression coefficient obtained by the regression analysis on the sample of the document data stored in the document storage means 2, as in 3-1. Regression model learning of the first embodiment. Calculate using regression equation.
  • the CPU 11 of the sentence unit search apparatus 1 uses the regression formula including the regression coefficient obtained by the regression analysis to identify the feature quantities dist, gram, and chain for each extracted word.
  • the reference probability for each word can be calculated.
  • a weighted word group is obtained by assigning the reference probability for each word as the weight value of the word.
  • the weighted word group that represents a group of meanings for each sentence has a one-dimensional word, and has a reference probability calculated for each word as an element of a dimension component corresponding to each word. Treat as a tuttle. That is, the meaning of sentences in the document data stored in the document storage means 2 is extracted from the document data stored in the document storage means 2 and stored in the list shown in FIG. It can be represented by a vector in dimensional space.
  • the document data to be stored in the document storage means 2 with the result of the CPU 11 of the sentence unit search apparatus 1 calculating the reference probability in the second embodiment stored in the document storage means 2 is shown in the explanatory diagram of FIG. 11 of the first embodiment. This is the same as the document data shown. That is, the document data stored in the document storage means 2 stores the dimension number and the value of the reference probability that is an element of the dimension component.
  • the CPU 11 of the sentence unit search apparatus 1 according to the second embodiment calculates the word reference probability for each sentence of the tagged document data stored in the document storage means 2 and associates it with each sentence. Since the processing procedure stored in the database is the same as that in the first embodiment, the explanation is omitted.
  • a process for searching for a sentence in a document stored in the document storage unit 2 when the CPU 11 of the sentence unit searching apparatus 1 receives text data indicating a word received by the receiving apparatus 4 will be described.
  • the CPU 11 of the sentence unit search apparatus 1 also represents a set of contextual meanings of the accepted words as textual manifestation vectors indicating the directionality in the multidimensional space of the words for the text data indicating the accepted words. .
  • the CPU 11 of the sentence unit search device 1 uses the feature amounts dist, gram, and 31245-dimensional words stored in the list for the text data received from the reception device 4. Specifies the feature pattern represented by chain. It should be noted that if it appears in the text data received as a series in the past, the feature pattern specification is omitted by setting the corresponding dimension component element to 0 for the word.
  • the reference probabilities as elements of the dimension component can be calculated based on the regression equation. Therefore, each time the text data is received, the CPU 11 of the sentence unit search device 1 can calculate a manifestation vector representing a group of meanings in the context of the word indicated by the received text data.
  • the CPU 11 of the sentence unit search device 1 compares the manifestation vector calculated for the received word and the manifestation vector of the sentence with the salience attribute added in advance stored in the document storage means 2. The distance is directly calculated by an outer calculation, and a sentence with a short distance is extracted. Sentences with similar directionality of meanings can be searched in a 31245-dimensional multidimensional space where each word in Fig. 6 is one-dimensional.
  • the CPU 11 of the sentence unit search device 1 transmits text data indicating the extracted sentence to the accepting device 4 via the communication means 15. Vector operation If you use a computer that can handle, you can directly calculate the meaning of each sentence as a manifestation vector.
  • FIG. 19 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the second embodiment.
  • the same reference numerals are used for the same steps as the processing procedures of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. Detailed description is omitted.
  • the CPU 11 of the sentence unit search device 1 narrows down to words for which a reference probability equal to or greater than a predetermined value is calculated for all words stored in the temporary storage area 14 by calculating the reference probabilities (steps).
  • the manifestation vector of the accepted word is calculated based on each narrowed word and the calculated reference probability of each word (step S501).
  • a manifestation vector that quantitatively represents a group of meanings in the flow following the previously accepted word power can be generated as a search request for the accepted word.
  • the following processing compares the manifestation vector obtained for the accepted word and the manifestation vector of each sentence stored in advance, and calculates the weight value of each word represented by each manifestation vector. It is an example of the process which determines whether it is a force with similar distribution.
  • CPU 11 reads the weighted word group stored in the database, that is, the manifestation vector (step S502). At this time, the obviousness vector force stored in the manifestation vector force database associated with the accepted words obtained in the processes up to step S411 is used. In the same manner as in the above, it is determined to which group it belongs. The CPU 11 reads the manifestation vector of the group to which the manifestation vector associated with the accepted word belongs from the database. As a result, it is possible to narrow down and extract a manifestation vector having a similar distribution of weight values for each word.
  • CPU 11 calculates the distance between the saliency vector associated with the accepted word and the read saliency vector (step S503).
  • the CPU 11 narrows the read manifest vector to the manifest vector whose calculated distance is less than the predetermined value (step S504), and reads the sentence that is stored in association with the narrowed manifest vector. (Step S505) o
  • the CPU 11 gives similarities to the read sentences in order of increasing calculated distance (step S506).
  • step S501 to step S506 by the processing from step S501 to step S506 by the CPU 11 of the sentence unit searching apparatus 1 in the second embodiment, sentences having similar contextual meaning to the accepted words are extracted.
  • step S417 for the extracted sentence is the same as in the first embodiment.
  • step S503 for calculating the distance between the manifestation vector associated with the words received by the CPU 11 and the read manifestation vector in the above-described processing procedure is concretely. Calculate as follows. When the manifestation vector associated with the accepted word U is represented as v (u) and the retrieved manifestation vector force (s), the CPU 11 calculates the cosine distance as shown in equation (5) below. Is calculated.
  • step S506 the CPU 11 assigns similarities in descending order of the calculated cosine distance.
  • weight values representing the manifestation of each word are deeply related to the word.
  • a recalculation process is performed taking into account associations from other words.
  • an association is a case where a word in a group of weighted words associated with each sentence unit does not appear in that sentence unit or the preceding sentence unit. If the word is deeply related to the word and the word is highly apparent, it means that the word is also attracting attention in units of sentences. Therefore, when a single word is attracting attention, a word that is easily noticed at the same time is taken as a related word. Then, the influence of the visibility of closely related words is reflected in the weight value representing the manifestation of each word.
  • FIG. 20 is an explanatory diagram showing an overview of the influence of the manifestation of one word and a word closely related to the search method of the present invention in the third embodiment.
  • the explanatory diagram of FIG. Represents an example of conversation between users.
  • a conversation is a set of utterances u, U, U, U
  • the value may have dropped.
  • the rate should have a high value.
  • the weight value representing the manifestation of each word associated with each sentence or word is recalculated in consideration of the manifestation of the related word (related word).
  • the sentence unit search device 1 In order to recalculate the reference probability to a weight value that takes into account the manifestation of related words, the sentence unit search device 1 first obtains information representing the power of which the words are closely related to each other. It is necessary to keep it. Then, the influence of the relevance representing the depth of the relation is reflected in the reference probability of each word calculated for each sentence unit. Specifically, for example, when the above example is used, the degree of association of “America Village” with “Osaka” is quantitatively calculated. Next, it is calculated as a weight value that represents the manifestation of “Osaka” on a sentence-by-sentence basis by reflecting the effect of the relevance to “Osaka” on the reference probability of “America Village”. Recalculate and store.
  • the sentence unit search device 1 creates a weighted related word group for one word to which the relevance of each word to one word is given as a weight value. .
  • a weighted word group that is stored in association with each sentence unit by the processing of “3-3.
  • the sentence-by-sentence search apparatus 1 creates a weighted related word group for each word by using the combination of the word and the reference probability of the word or the manifestation rule.
  • the sentence unit search device 1 creates and stores a weighted related word group for each word extracted from the entire document set.
  • the sentence unit search device 1 stores the weighted word group that is stored in association with each sentence unit, that is, the combination of the word and the word reference probability or each word of the manifestation vector.
  • the influence of the reference probability of words that are closely related to each word is reflected in the reference probability using the degree of association, and the weight value of each word is recalculated and stored.
  • the sentence unit search apparatus 1 similarly uses the degree of relevance for the weighted word group associated with each word, that is, the combination of the word and the word reference probability or the manifestation vector in the search process. Then recalculate the weight value of each word.
  • the sentence unit search device 1 performs a search process based on the word corresponding to the accepted word and the weight value recalculated for each word.
  • the related word group is created by performing the following processing by the sentence unit search device 1 for every word extracted in the explanatory diagram shown in FIG.
  • the sentence unit retrieval apparatus 1 uses a weighted word group stored in association with every sentence unit in "3-3. Quantification of manifestation per sentence unit".
  • a word group with a weight that has a reference probability of the word or more is extracted. This is because, as described above, the related word is a word that is likely to be noticed at the same time when one single word is noticed, so that the sentence unit is removed when one word is noticed. It is to do.
  • the sentence unit search device 1 integrates the weighted word groups having the reference probability of one word that is extracted by the above-described processing and having a predetermined value or more. Specifically, the reference probability of each word in each weighted word group is weighted by the reference probability of one word included in the weighted word group, and the reference probability of each word is averaged. The reason why the reference probability of one word is weighted is that the reference probability for each word in the weighted word group having a higher reference probability of one word is used.
  • FIG. 21 and FIG. 22 are flowcharts showing a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment creates a related word group.
  • the processing shown in the flowcharts of FIG. 21 and FIG. 22 includes processing for extracting a word group having a weight value equal to or greater than a predetermined value for one word, and integrating the weight value of each word of the extracted word group as a degree of association.
  • the process of creating a group of related words assigned to each word, the process of storing it in association with a single word, and for each word! Corresponds to the process that executes each process.
  • the CPU 11 of the sentence unit search device 1 selects one word from the list stored in the storage means 13 (step S601).
  • the CPU 11 acquires tagged document data from the document storage means 2 via the document set connection means 16 (step S602).
  • CPU11 identifies the tag “su>” added to the acquired document data by character string analysis and reads the sentence unit. Extrude (step S603).
  • the CPU 11 reads out the salience attribute stored in su> (step S604), and in step S601 of the set of words and word reference probabilities (weighted word group) stored in the salience attribute. It is determined whether or not the reference probability of the selected one word is greater than or equal to a predetermined value (step S605).
  • the CPU 11 determines that the reference probability is equal to or higher than the predetermined value (S605: YES)
  • the CPU 11 stores the weighted word group read out with the salience attribute in step S604 in the temporary storage area (step S606).
  • CPU 11 determines whether or not the processing up to step S606 has also been executed for step S604 for the entire text unit of the document data acquired at step S602 (step S607). If it is determined that CP Ul 1 is in the whole text unit and the process is executed (S607: NO), CCU11 returns the process to step S603 and reads the subsequent sentence unit (S603). The processes from step S604 to step S606 are executed.
  • the CPU 11 determines the weighted word group in which the reference probability of the selected one word is a predetermined value or more for all the document data. It is determined whether or not the force is extracted (step S608). If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is not less than a predetermined value (S608: NO), the CPU 11 returns the process to step S602 and continues to the next step.
  • the document data is acquired (S602), and the processing from step S603 to step S607 is executed.
  • step S608 If the CPU 11 determines that a weighted word group having a reference probability of one word selected for all document data is greater than or equal to a predetermined value (S608: YES), the CPU 11 is extracted by the process of step S606. Then, a set of weighted word groups stored in the temporary storage area 14 is created by calculating the sum of weight values weighted by the reference probability of one word for each word (step S609). ).
  • the CPU 11 determines that the reference probability of one word created in step S609 is a predetermined value or more.
  • the sum of the weighted word groups, that is, the weight value of each word of the summed weighted word group is normalized (step S610).
  • CPU 11 selects, in step S601, a weighted word group in which the reference probability of one word normalized in step S610 is greater than or equal to a predetermined value as a related word group having each weight value as a degree of relevance. Is stored in the storage means 13 or in the document storage means 2 via the document set connection means 16 (step S611).
  • step S612 the CPU 11 of the sentence unit search device 1 determines whether or not it has created and stored related word groups for all the words in the list stored in the storage means 13 (step S612). If CPU 11 creates and stores related words for all words and determines that they are not (S612: NO), CPU 11 returns the process to step S601 and selects the next word ( S601), the processing from step S602 to step S611 is executed for the selected word.
  • step S605 the CPU 11 of the sentence unit search device 1 simply performs the following normal process, rather than simply determining whether the reference probability is greater than or equal to the predetermined value.
  • the force may be compared with a predetermined value.
  • the CPU 11 of the sentence unit search device 1 uses the square root of the sum of squares of all reference probabilities so that the sum of the squares of the reference probabilities of each word associated with the sentence unit is “1”. Normalize by dividing
  • the CPU 11 of the sentence unit search device 1 performs normality by dividing each weight value by the square root of the sum of squares of all weight values.
  • the CPU 11 of the sentence unit search apparatus 1 specifies the related word group created when the processing shown in the flowcharts of Figs. 21 and 22 is performed for one word. An example is shown.
  • FIG. 23 is an explanatory diagram showing an example of a weighted word group in the course of each process when a related word group is created by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment.
  • the CPU 11 of the sentence unit search device 1 uses the word “ This is an example in which a weighted word group with a reference probability of “America Village” with a predetermined value (0.2) or more is extracted.
  • FIG. 23 (a) shows the weighted word groups GW, GW, GW extracted by the processing of the CPU 11 in step S605 shown in the flowcharts of FIGS. 21 and 22 and stored in the temporary storage area 14. ing.
  • Figure 23 (b) shows the same for step S607.
  • a weighted word group GW ′ ′ weighted and summed up by the processing of U11 is shown.
  • weighted word groups GW, GW, GW having a weight value (reference probability) of one word “America Village” with a predetermined value of 0.2 or more are extracted.
  • the value is multiplied by the weight value (reference probability) of one word “America Village” in each weighted word group.
  • the weight value reference probability
  • the weight value of each word of the generated word group GW ′, GW ′, GW is as follows.
  • the weight value (reference probability) of the word “America Village” is multiplied.
  • the weight value of each word in the weighted word group G W has an American character because the weight value (reference probability) of America Village is 0.6.
  • Word group GW (Autumn: 0 (0.6X0), American Village: 0.36 (0.6X0.6), ..., Okumaza: 0
  • the weight value of each word in the weighted word group GW ,! is one word “American Village” as shown in FIG. 23 (b).
  • the weight values weighted by the weight values (reference probabilities) are summed for each word.
  • the weight value of each word in the word group GW ′ shown in FIG. 23 (c) is summed as follows: the word group GW ′, GW ′, GW shown in FIG. 23 (b).
  • the CPU 11 of the sentence unit search device 1 squares the weight value of each word, calculates the square root of the sum of the squared values, Divide by the weight value of each word and make sure that the weight value of each word in the weighted word group GW '' is normalized.
  • the weighted word group GW '' integrated by weighting and summing is a multidimensional vector with each word as one dimension and the weight value of each word as an element in each dimension.
  • the multidimensional vector may be normalized by dividing each weight value (element) by the norm of the multidimensional vector. At this time, the norm is not necessarily the Euclidian norm.
  • the weighted word group power as a result of summing and normalizing in this way is created as a related word group of "America Village" by the CP U11 of the sentence unit search device 1.
  • the example shown below is an example of a related word group of the word “Ame Rikimura”. Each word is listed in descending order of weight value.
  • each weight value of the related word group created for the word w that is, the word w to the word w
  • bw (w: b, w: b, "', w:
  • the CPU 11 of the sentence unit search device 1 repeats the above-described process for all the words shown in the explanatory diagram of FIG. 6 to create a related word group for each word, and creates the document storage means 2 or the sentence. It is stored in the storage means 13 of the unit search device 1. In this way, weights that represent a group of meanings for each sentence unit are created and stored by creating a related word group in which the degree of association is quantitatively calculated for each word that appears in the document set. It is possible to reflect the influence of related words on the attached word group.
  • the degree of relevance of each word of the created related word group is reflected in the weighted word group stored for each sentence unit, that is, the set of the word and the reference probability of each word or the manifestation vector.
  • the sentence unit search device 1 reads the reference probability of each word that has already been calculated and stored, and uses each word's reference probability as a single word weight value as a single word weight value. A value obtained by multiplying the degree of relevance is recalculated and stored.
  • FIG. 24 shows a processing procedure in which the CPU 11 of the sentence unit search apparatus 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows. The process shown in the flowchart of FIG. 24 corresponds to the process of reassigning the weight value of each word of the weighted word group associated with each sentence unit using the degree of association.
  • the CPU 11 of the sentence unit search device 1 acquires tagged document data from the document storage unit 2 via the document set connection unit 16 (step S71).
  • the CPU 11 identifies the tag “su>” added to the acquired document data by character string analysis, and reads out the sentence unit (step S72).
  • the CPU 11 reads the salience attribute stored in ⁇ su> (step S73), and stores the word and word reference probability pair (weighted word group) stored in association with the salience attribute.
  • Each of the reference probabilities is recalculated to a weight value that takes the association into account using the related word group (step S74).
  • the CPU 11 re-stores each word and a weighted word group (a manifestation vector), which is a set of weight values recalculated in step S74, with the salience attribute added (step S75).
  • step S76 determines whether or not the sentence unit read in step S72 is the end of the document data. Whether the current sentence is the end of the acquired document data is determined by whether or not the su> tag follows the su> ⁇ / su> that sandwiches the current sentence. If so, it can be determined that it is the end. If the CPU 11 determines that it is not the end of the document data (S76: NO), the CPU 11 returns the processing to step S72 and continues the processing for the next sentence unit. On the other hand, if the CPU 11 determines that the end of the document data (S76: YES), the CPU 11 recalculates the weight value of each word in the weighted word group and associates it with the salience attribute for all document data. Judgment is made as to whether or not the processing to be memorized is completed (step S77).
  • the CPU 11 of the sentence unit retrieval apparatus 1 realizes the recalculation of the weight value of each word in step S74 by performing the following processing.
  • FIG. 25 is a processing procedure in which the CPU 11 of the sentence unit search device 1 in Embodiment 3 recalculates the weight value of each word in the weighted word group stored in association with each sentence unit. It is a flowchart which shows the detail of. The process shown in the flowchart of FIG. 25 corresponds to a process of multiplying the relevance level of each word by the weight value of the weighted word group, and a process of reassigning the weight value of each word based on the multiplied weight value. [0292]
  • the CPU 11 of the sentence unit search device 1 reads each word of the weighted word group stored in association with the salience attribute read in step S74 of the flowchart of Fig. 24 and the reference probability of each word, and temporarily stores them. Stored in area 14 (step S81). The CPU 11 selects one of the words (step S82), and performs the following processing for the weight value of the selected word.
  • the CPU 11 reads the related word group to which the relevance level of each word stored in the storage means 13 or the document storage means 2 is given (step S83).
  • the CPU 11 acquires the degree of relevance from each word to one word from the related word group of each read word (step S84).
  • the CPU 11 multiplies the obtained degree of association from each word to one word by the reference probability of each word stored in the temporary storage area 14, and calculates the sum (step S85).
  • the CPU 11 determines whether or not the weight value has been recalculated for all the words stored in the temporary storage area 14 in step S81 (step S86). If CPU 11 determines that the weight value has not been recalculated for all the words (S86: NO), CPU 11 returns the process to step S82 and moves to the next word !, from step S82 to step S85. The process of recalculating the weight value is executed. If the CPU 11 determines that the weight value has been recalculated for each word (S86: YES), the CPU 11 returns the process to step S75 of the flowchart of FIG.
  • the processing for recalculating the weight value by the CPU 11 of the sentence unit search device 1 shown in the flowchart of FIG. 24 and step S74 in the flowchart of FIG. 24 is to calculate the reference probability in the first embodiment. Then, it may be executed in the process of storing it as a weight value representing the manifestation of each sentence unit.
  • the configuration may be such that the processing shown in the flowchart of FIG. 25 and step S74 is executed between the processing of step S306 and step S307 in the processing procedure shown in the flowchart of FIG.
  • the CPU 11 of the sentence unit search device 1 recalculates the reference probability calculated for each word to a weight value that reflects the association.
  • a weight value that reflects the association.
  • the sentence unit search device 1 calculates the weight value representing the manifestation of "Osaka” in a sentence unit as follows. cure. It is assumed that the relevance group created for “America Village” is “0.3” for “Osaka”. Even if a word stored in association with a sentence unit contains "America Village”, the reference probability of "America Village” is 0.4, and "Osaka" is not included.
  • the CPU 11 of the sentence unit search device 1 multiplies the reference probability 0.4 of “America Village” by the relevance level 0.3 from “America Village” to “Osaka” to obtain “Osaka” in the sentence unit.
  • the weight value of is recalculated to “0.12” instead of “0”.
  • s is the weight value representing the manifestation in each sentence s of the word w with contextual association.
  • the sentence unit search apparatus 1 recalculates the weight value of each word as shown in the following formula (6).
  • V (Sj) alience ⁇ w ⁇
  • the expression in the last line of the expression (7) is a weighted word group, that is, a pair of a word and a word reference probability as a manifestation vector v (s).
  • the manifestation vector V (k after resolving the association with salienc e (w I pre (s)) as the k-th element
  • s represents the principle of calculating the weight value of each word.
  • each bw, ..., bw is a vector of related words for all words w, ..., w
  • Toru V (s) is the manifestation i 1 N in the oblique coordinate system based on the relevance vector bw, ..., bw
  • the manifestation vector V (s) taking into account the association can be interpreted as the manifestation vector v (s) with the reference probability as an element as it is rotated in the direction of the related word axis.
  • the oblique coordinate system based on the relevance vectors bw,..., Bw is each unit that reflects the association.
  • each base vector (vector of size 1 in the direction of each word dimension) is a coordinate system in which the angle between the base vectors of words that are not related to each other and are highly related is small. It is. [0308] When the transformation matrix with each element of b is multiplied by the manifestation vector with the reference probability as the element, j, k
  • FIG. 26 is an explanatory diagram showing an example of the contents of weight values representing the manifestation of each word calculated by the CPU 11 of the sentence unit search apparatus 1 according to the third embodiment.
  • the weight value of each word for each sentence s, s shown in Fig. 26 (a) is the related word.
  • the weight value of each word for each sentence s, s shown in Fig. 26 (b) is the value after association is considered using the related word group.
  • Fig. 26 The specific example shown in Fig. 26 is an example of sentence units extracted from a Japanese spoken language corpus (http: ⁇ www. Kokken.go.jp/katsudo/kenkyujyo/corpus 8 CSjZvoll7ZD03F0040).
  • the CPU 11 of the sentence unit search apparatus 1 sets a combination of a word and a word reference probability or a manifestation vector, that is, a weighted word, that quantitatively represents the meaning of the accepted word. Add associations with related words to the group. Below, we recalculate the weight value of each word in the weighted word group associated with the word received by the CPU 11 of the sentence unit search device 1, taking the association into account, and perform a search based on the recalculated weight value. The process to be executed will be described.
  • FIG. 27 is a flowchart showing a processing procedure of search processing of the sentence unit search device 1 and the reception device 4 in the third embodiment.
  • the same reference numerals are used for the same steps as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16 and 17 in the first embodiment. Detailed description is omitted.
  • the processing in step S4001 surrounded by the two-dot chain line is different from the processing procedures shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment. . That is, the difference is that step S4001 described below is added between step S411 and step S412.
  • the CPU 11 calculates the reference probabilities in the temporary storage area 14 and narrows down all words stored with reference probabilities greater than or equal to a predetermined value (step S411), and calculates in step S408.
  • the calculated reference probability is recalculated to a weight value that reflects the association (step S4001).
  • step S4001 the CPU 11 recalculates the weight value reminiscent of the association, as in the process shown in the flowchart of FIG. 25, selects one word at a time, and selects each word for the selected word. Is calculated by multiplying the degree of relevance of each word by the reference probability of each word.
  • the CPU 11 reads the weighted word group to which the association is added, which is stored in association with each sentence with respect to the weighted word group to which the association obtained in step S4001 is added. And execute a process of extracting a similar sentence. Since the subsequent processing for the weighted word group to which the association is added is the same as that in the first embodiment, detailed description thereof is omitted.
  • the sentence unit search apparatus 1 is a group of meanings using associated words and taking into account associations between sentences separated from the document data stored in the document storage means 2 and received words. It is possible to directly output a sentence that is judged to be similar. Therefore, by executing the sentence unit search method of the present invention, it is possible to effectively extract sentence units having similar contextual meanings in association with associations and directly output them.
  • the CPU 11 of the sentence unit search device 1 associates a weighted word group with the received word, and determines whether the word is similar to the weighted word group stored in advance for each sentence. Judgment In this case, as in the processing procedure shown in the flowchart of FIG. 27, it is not always determined based on whether or not the weighted word group includes the same word. Furthermore, the difference between the weight values assigned to the same word is calculated, and the smaller the calculated difference, the more similar it is not necessarily determined.
  • the CPU 11 of the sentence unit search apparatus 1 extracts a sentence unit whose meaning is similar to the accepted word, and expresses the meaning as a manifestation vector and a relevance vector. The case where this is realized by calculating the distance between them will be described below.
  • FIG. 28 is a flowchart showing the search processing procedure of the sentence unit search device 1 and the reception device 4 when the vector representation in the third embodiment is used. Note that the processing procedure shown in the flowchart of FIG. 28 is the same as the processing procedure of the search processing shown in the flowcharts of FIGS. 15, 16, and 17 in the first embodiment and the flowchart of FIG. 19 in the second embodiment. The same reference numerals are used for the respective steps, and detailed description thereof is omitted.
  • each step S50 surrounded by the alternate long and short dash line is also processed up to step S506.
  • the processing in step S 5001 surrounded by a two-dot chain line is different from the processing procedure shown in the flowchart of FIG. 19 in the second embodiment. That is, the difference is that step S5001 described below is added between step S501 and step S502.
  • the CPU 11 of the sentence unit search device 1 recalculates the manifestation vector calculated in step S501 into an manifestation vector reflecting the association of related words (step S5001).
  • the CPU 11 reads the weighted word group obtained in step S5001 in consideration of the association and stores the manifestation vector in consideration of association, which is stored in association with each sentence. , A process for extracting similar sentences is executed. A manifestation vector with associations added The process of reading and extracting a similar sentence is the same as in the second embodiment, and a detailed description thereof is omitted.
  • step S5001 by CPU 11, the process of recalculating the manifestation vector into the manifestation vector taking into account the association with the related word is performed using the relevance vector group (matrix) of the manifestation vector calculated in step S501. Convert (rotate) and calculate as shown in equation (7). Specifically, the manifestation vector V () is calculated by adding the above association to the multidimensional vector v (s) whose element is only the reference probability.
  • step S503 for calculating the distance between the manifestation vector associated with the word accepted by CPU 11 and the read manifestation vector Specifically, in the third embodiment, the calculation is as follows.
  • the manifestation vector recalculated with the association of the received word U is represented as V (u), and the retrieved manifestation vector with the association added in advance is represented as V (s).
  • the CPU 11 calculates the cosine distance as shown in the following equation (8).
  • step S506 the CPU 11 gives similarities in descending order of the calculated cosine distance.
  • the manifestation vector associated with each sentence unit and the word is the dimension direction of a word having a high degree of relevance in which the dimensions corresponding to each word are not orthogonal. It is handled in an oblique coordinate system in which the angle between them becomes small. For this reason, when comparing the distances between vectors when determining whether or not they are similar, if there is an element in the dimension direction of a word that has a high degree of association, it is determined that the words are similar. Become so.
  • the sentence unit s is the accepted word when the manifestation of the Dutch word is high in the accepted word. Is not judged to be similar. However, the obviousness of “America Village” in the accepted word is high. When the accepted word is high, the manifestation of “Osaka” is excited and increased, so the sentence unit s is similar to this accepted word. This increases the possibility of being judged.
  • the text data received as the search result is displayed on the monitor of the display means 46 provided in the reception device 4, but the received text data is also voiced.
  • a configuration may be adopted in which the signal is converted and output via the speaker of the audio input / output means 47.
  • the user can obtain a sentence with similar context and meaning as a search result by using multiple words that he / she has input or by inputting a conversation with another user.
  • the received words also have spoken language skills
  • sentences that are omitted in utterances and that have similar word manifestations, including words represented by zero pronouns, can be obtained directly as search results.
  • the sentence unit search apparatus 1 specifies and stores information indicating the manifestation for each sentence, but the tag ⁇ A configuration may be adopted in which p> ⁇ Zp> is sandwiched, a feature pattern is specified for the paragraph, information indicating the manifestation is stored by the salience attribute, and the paragraph is output as a search result. It is not limited to a sentence or paragraph, but may be a phrase as long as it is a unit that represents a set of certain meanings. In the case of spoken language, the character string that can be identified as one sentence can be very long.
  • the document data composed of spoken language is stored in advance separately from the document data that also has writing ability.
  • a configuration may be adopted in which the document storage means 2 stores the probability every time a word feature pattern is specified and a reference probability is calculated.
  • the CPU 11 of the sentence unit search device 1 determines whether or not the consecutively received words are a series of words, information for identifying the accepting device 4 that is the transmission source of the words, and the accepting device. It is also possible to use information indicating that 4 has detected a user's search start 'completion operation.
  • words can be stored in the document storage unit 2 in units corresponding to pages of document data stored in the document storage unit 2 in advance.
  • the sentence unit search device 1 performs all of the processing for obtaining and tagging document data, regression analysis for obtaining the reference probability, and processing when a word is received.
  • it may be divided into a sentence unit search device and a document storage device.
  • the document data is acquired by performing Web crawling in the document storage device, and further, a tag is added to the text data by morphological analysis and syntactic analysis and stored.
  • an equation for calculating the reference probability is obtained by regression analysis based on the document data stored in the document storage device, and the sentence data is stored for each sentence using the obtained equation.
  • the process of storing the word and the reference probability of the word is performed in advance.
  • the sentence unit search device specifies a feature pattern when text data converted from words is received, acquires a regression formula for calculating a document storage device force reference probability, calculates a reference probability, and performs a search.
  • Embodiments 1 to 3 the input of words such as a character string input or a voice input from the user is converted into text data by the reception device 4, and transmitted to the sentence unit search device 1.
  • the configuration is as follows.
  • the sentence unit searching apparatus 1 may be configured to include an input / output unit that receives a user's character string input operation and a voice input unit that receives a user's voice input.
  • FIG. 29 is a block diagram showing a configuration in the case where the sentence unit retrieval method 1 of the present invention is implemented by the sentence unit retrieval apparatus 1.
  • the sentence unit search device 1 includes a CPU 11, an internal bus 12, a storage unit 13, a temporary storage area 14, a document set connection unit 16, and an auxiliary storage unit 17, as well as a mouse or a keyboard that accepts user operations. It further includes an operation means 145, a display means 146 such as a monitor, and a voice input / output means 147 such as a microphone and a speaker.
  • the CPU 11 of the sentence unit search device 1 detects the frequency or the conversation speed indicating the characteristics of the speech input from the speech input means, and utters it.
  • the feature pattern of each word can be specified.
  • the grammatical feature pattern of each word can be converted to text data by speech recognition and searched based on the text data.
  • the accepting devices 4, 4,... Only convert the received character string or speech word into a certain length, convert it into digital data, and transmit it. It was configured as a device.
  • the receiving device 4, 4,... Receives the program stored in the storage means 43 by the receiving device 4, 4,.
  • the attached words may be configured to perform natural language analysis such as morphological analysis and syntactic analysis, or phonemic analysis.
  • the CPU 41 of the accepting devices 4, 4,... Calculates a weight value that represents the manifestation of each word in the accepted words, and transmits the calculated weighted word group to the sentence unit retrieving device 1 as a search request. But you can.
  • the sentence unit search method according to the present invention is a combination capable of voice recognition of conversation between users.
  • the present invention can be applied to an application in which a computer apparatus participates in a conversation between users and realizes a conversation. It can also be applied to applications that provide a conversation-linked advertisement presentation service that switches according to the flow of conversation or chat context between users. It can also be applied to conference support services that present similar and related minutes from past minutes depending on the context flow during the meeting. Furthermore, it is also possible to apply it to a writing support service that accepts texts in writing as words and provides related information according to the context flow.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

La présente invention concerne un ordinateur pour exécuter un procédé de recherche de phrase qui trie au préalable des données de document sur un ensemble de documents dans les phrases. Des informations représentant la cohésion de la signification reflétant le flux du contexte d'une phrase précédente à une autre phrase, à savoir un groupe pondéré de mots dans lequel une valeur de poids est donnée à chaque mot d'une phrase, sont associées à chaque phrase et les phrases et les groupes de mots pondérés associés sont stockés. Lorsque l'ordinateur reçoit un mot, il acquiert des informations représentant la cohésion de la signification dans le flux de la conversation prononcée, à savoir un groupe de mots pondéré dans lequel une valeur de poids est donnée à chaque mot, puis les associe, il extrait une phrase similaire à la cohésion de la signification en fonction du groupe de mots pondéré associé au mot et produit la phrase sous forme de résultat de recherche. La valeur de poids donnée à chaque mot peut être une valeur reflétant l'influence en fonction de la valeur de poids du mot lié dans la phrase et le degré de relation entre le mot lié et chaque mot.
PCT/JP2007/055448 2006-08-21 2007-03-16 Procédé de recherche de phrase, moteur de recherche de phrase, programme informatique, support d'enregistrement et stockage de document WO2008023470A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008530812A JP5167546B2 (ja) 2006-08-21 2007-03-16 文単位検索方法、文単位検索装置、コンピュータプログラム、記録媒体及び文書記憶装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2006224563 2006-08-21
JP2006-224563 2006-08-21

Publications (1)

Publication Number Publication Date
WO2008023470A1 true WO2008023470A1 (fr) 2008-02-28

Family

ID=39106564

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2007/055448 WO2008023470A1 (fr) 2006-08-21 2007-03-16 Procédé de recherche de phrase, moteur de recherche de phrase, programme informatique, support d'enregistrement et stockage de document

Country Status (2)

Country Link
JP (1) JP5167546B2 (fr)
WO (1) WO2008023470A1 (fr)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009282936A (ja) * 2008-05-26 2009-12-03 Nippon Telegr & Teleph Corp <Ntt> 選択式情報提示装置および選択式情報提示処理プログラム
JP2013140500A (ja) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> 単語抽出装置及び方法及びプログラム
JP2013140499A (ja) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> 単語抽出方法及び装置及びプログラム
JP2015506509A (ja) * 2011-12-28 2015-03-02 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 評価情報を生成するための方法およびシステム、ならびにコンピュータ記憶媒体
CN108710613A (zh) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 文本相似度的获取方法、终端设备及介质
CN110083681A (zh) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 基于数据分析的搜索方法、装置及终端
JP2020042771A (ja) * 2018-09-07 2020-03-19 台達電子工業股▲ふん▼有限公司Delta Electronics,Inc. データ分析方法及びデータ分析システム
US10614065B2 (en) 2016-10-26 2020-04-07 Toyota Mapmaster Incorporated Controlling search execution time for voice input facility searching
JP2020057105A (ja) * 2018-09-28 2020-04-09 株式会社リコー 言語処理方法、言語処理プログラム及び言語処理装置
CN111489743A (zh) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 一种基于智能语音技术的运营管理分析系统
JP2020149369A (ja) * 2019-03-13 2020-09-17 株式会社東芝 対話制御システム、対話制御方法及びプログラム
CN111753498A (zh) * 2020-08-10 2020-10-09 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及存储介质
CN112784577A (zh) * 2021-01-26 2021-05-11 鲁巧巧 一种英语教学用语句关联学习系统
CN113761157A (zh) * 2021-05-28 2021-12-07 腾讯科技(深圳)有限公司 应答语句生成方法和装置
US11397776B2 (en) 2019-01-31 2022-07-26 At&T Intellectual Property I, L.P. Systems and methods for automated information retrieval
US11409804B2 (en) 2018-09-07 2022-08-09 Delta Electronics, Inc. Data analysis method and data analysis system thereof for searching learning sections
CN117851614A (zh) * 2024-03-04 2024-04-09 创意信息技术股份有限公司 一种用于海量数据的搜索方法、装置、系统及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287291B (zh) * 2019-07-03 2021-11-02 桂林电子科技大学 一种无监督的英语短文句子跑题分析方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06162092A (ja) * 1992-11-18 1994-06-10 Fujitsu Ltd 情報検索装置
JP2004234175A (ja) * 2003-01-29 2004-08-19 Matsushita Electric Ind Co Ltd コンテンツ検索装置およびそのプログラム
JP2005250762A (ja) * 2004-03-03 2005-09-15 Mitsubishi Electric Corp 辞書生成装置、辞書生成方法および辞書生成プログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06162092A (ja) * 1992-11-18 1994-06-10 Fujitsu Ltd 情報検索装置
JP2004234175A (ja) * 2003-01-29 2004-08-19 Matsushita Electric Ind Co Ltd コンテンツ検索装置およびそのプログラム
JP2005250762A (ja) * 2004-03-03 2005-09-15 Mitsubishi Electric Corp 辞書生成装置、辞書生成方法および辞書生成プログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOKUNAGA T.: "Gengo to Keisan 5 Joho Kensaku to Gengo Shori", vol. 1ST ED., 1999, ZAIDAN HOJIN UNIVERSITY OF TOKYO PRESS, XP003021201 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009282936A (ja) * 2008-05-26 2009-12-03 Nippon Telegr & Teleph Corp <Ntt> 選択式情報提示装置および選択式情報提示処理プログラム
JP2015506509A (ja) * 2011-12-28 2015-03-02 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 評価情報を生成するための方法およびシステム、ならびにコンピュータ記憶媒体
JP2013140500A (ja) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> 単語抽出装置及び方法及びプログラム
JP2013140499A (ja) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> 単語抽出方法及び装置及びプログラム
US10614065B2 (en) 2016-10-26 2020-04-07 Toyota Mapmaster Incorporated Controlling search execution time for voice input facility searching
CN108710613A (zh) * 2018-05-22 2018-10-26 平安科技(深圳)有限公司 文本相似度的获取方法、终端设备及介质
JP2020042771A (ja) * 2018-09-07 2020-03-19 台達電子工業股▲ふん▼有限公司Delta Electronics,Inc. データ分析方法及びデータ分析システム
US11409804B2 (en) 2018-09-07 2022-08-09 Delta Electronics, Inc. Data analysis method and data analysis system thereof for searching learning sections
JP2020057105A (ja) * 2018-09-28 2020-04-09 株式会社リコー 言語処理方法、言語処理プログラム及び言語処理装置
JP7147439B2 (ja) 2018-09-28 2022-10-05 株式会社リコー 言語処理方法、言語処理プログラム及び言語処理装置
CN111489743A (zh) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 一种基于智能语音技术的运营管理分析系统
US11397776B2 (en) 2019-01-31 2022-07-26 At&T Intellectual Property I, L.P. Systems and methods for automated information retrieval
JP2020149369A (ja) * 2019-03-13 2020-09-17 株式会社東芝 対話制御システム、対話制御方法及びプログラム
JP7055764B2 (ja) 2019-03-13 2022-04-18 株式会社東芝 対話制御システム、対話制御方法及びプログラム
CN110083681A (zh) * 2019-04-12 2019-08-02 中国平安财产保险股份有限公司 基于数据分析的搜索方法、装置及终端
CN110083681B (zh) * 2019-04-12 2024-02-09 中国平安财产保险股份有限公司 基于数据分析的搜索方法、装置及终端
CN111753498A (zh) * 2020-08-10 2020-10-09 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及存储介质
CN111753498B (zh) * 2020-08-10 2024-01-26 腾讯科技(深圳)有限公司 文本处理方法、装置、设备及存储介质
CN112784577A (zh) * 2021-01-26 2021-05-11 鲁巧巧 一种英语教学用语句关联学习系统
CN112784577B (zh) * 2021-01-26 2022-11-18 鲁巧巧 一种英语教学用语句关联学习系统
CN113761157A (zh) * 2021-05-28 2021-12-07 腾讯科技(深圳)有限公司 应答语句生成方法和装置
CN113761157B (zh) * 2021-05-28 2024-05-24 腾讯科技(深圳)有限公司 应答语句生成方法和装置
CN117851614A (zh) * 2024-03-04 2024-04-09 创意信息技术股份有限公司 一种用于海量数据的搜索方法、装置、系统及存储介质
CN117851614B (zh) * 2024-03-04 2024-05-14 创意信息技术股份有限公司 一种用于海量数据的搜索方法、装置、系统及存储介质

Also Published As

Publication number Publication date
JPWO2008023470A1 (ja) 2010-01-07
JP5167546B2 (ja) 2013-03-21

Similar Documents

Publication Publication Date Title
JP5167546B2 (ja) 文単位検索方法、文単位検索装置、コンピュータプログラム、記録媒体及び文書記憶装置
US9330661B2 (en) Accuracy improvement of spoken queries transcription using co-occurrence information
KR101279707B1 (ko) 문서에서 정의를 식별하는 방법 및 정의 추출 시스템
US20040148170A1 (en) Statistical classifiers for spoken language understanding and command/control scenarios
US20040073874A1 (en) Device for retrieving data from a knowledge-based text
US20040148154A1 (en) System for using statistical classifiers for spoken language understanding
US20070198511A1 (en) Method, medium, and system retrieving a media file based on extracted partial keyword
JP2004005600A (ja) データベースに格納された文書をインデックス付け及び検索する方法及びシステム
EP2348427B1 (fr) Appareil et procédé de récupération vocale
Favre et al. Robust named entity extraction from large spoken archives
AU2006317628A1 (en) Word recognition using ontologies
EP1016074A1 (fr) Normalisation de textes utilisant une grammaire independante du contexte
JP2004133880A (ja) インデックス付き文書のデータベースとで使用される音声認識器のための動的語彙を構成する方法
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
Sen et al. Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods
JP2006244262A (ja) 質問回答検索システム、方法およびプログラム
EP1331574B1 (fr) Une interface entité nommée pour plusieurs programmes d&#39;application clients
CN115759071A (zh) 基于大数据的政务敏感信息识别系统和方法
Lin et al. Enhanced BERT-based ranking models for spoken document retrieval
CN110020024B (zh) 一种科技文献中链接资源的分类方法、系统、设备
Rosset et al. The LIMSI participation in the QAst track
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Sen et al. Audio indexing
Safarik et al. Unified approach to development of ASR systems for East Slavic languages
Aliero et al. Systematic review on text normalization techniques and its approach to non-standard words

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07738893

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2008530812

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07738893

Country of ref document: EP

Kind code of ref document: A1