JP2006065387A

JP2006065387A - Text sentence search device, method, and program

Info

Publication number: JP2006065387A
Application number: JP2004243739A
Authority: JP
Inventors: Koju O; 洪涛王; Mosho Son; 茂松孫; Tsuguaki Ryu; 紹明劉
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-08-24
Filing date: 2004-08-24
Publication date: 2006-03-09
Anticipated expiration: 2024-08-24
Also published as: JP4534666B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a text sentence search device, a text sentence search method, and a text sentence search program for searching for a text sentence with high precision while taking semantic content of the text sentence into consideration. <P>SOLUTION: A word sectioning/word class giving part 14 sections a query text sentence into words and gives each of them a word class. A syntax/meaning analysis part 16 analyzes a semantic chunk, a center word, and a case of the query text sentence based on the word sectioning/word class giving data. A feature quantity generation part 18 sets a semantic chunk weight and a center word weight and generates semantic vector feature quantity of the query text sentence by using these weights, the word sectioning/word class giving data, a DF value of the text query text sentence among DF values of the words stored in a statistic data storage part 20, and the like. A similarity calculation part 22 calculates similarity between the semantic vector feature quantity of the query text sentence and that of a search object text sentence stored in a database storage part 24 and stores the result in a storage part 26. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、例えば計算機を利用してテキスト文を検索するテキスト文検索装置、検索方法、及びテキスト文検索プログラムに係り、特に、テキスト文の意味内容を考慮したテキスト文検索装置、テキスト文検索方法、及びテキスト文検索プログラムに関する。 The present invention relates to a text sentence search device, a search method, and a text sentence search program for searching for a text sentence using, for example, a computer, and more particularly to a text sentence search device and a text sentence search method in consideration of the semantic content of a text sentence. And a text sentence search program.

情報処理技術とインターネットの発展に伴い、各種の情報資源は予想以上の速度で増加している。このような情報洪水の中から、如何にしてユーザに必要な情報を取り出して提供するのかが大きな課題となっている。 With the development of information processing technology and the Internet, various information resources are increasing faster than expected. In such a flood of information, how to extract and provide necessary information to the user is a big issue.

その他、例文ベース機械翻訳においては、例文データベースから翻訳対象テキスト文と最も類似している例文を高速かつ高精度に検索することが必要である。翻訳対象テキスト文とよく似ている例文を見つければ、その例文の訳文を利用して翻訳対象テキスト文の訳文を容易に生成できる。従って、検索精度は翻訳の精度に左右し、高精度な高速検索技術は、機械翻訳として重要な技術の一つである。 In addition, in example sentence-based machine translation, it is necessary to retrieve an example sentence most similar to the text to be translated from the example sentence database at high speed and with high accuracy. If an example sentence that closely resembles the translation target text sentence is found, the translation sentence of the translation target text sentence can be easily generated using the translation of the example sentence. Therefore, the search accuracy depends on the accuracy of translation, and high-speed high-speed search technology is one of important technologies for machine translation.

計算機を用いた情報検索の主な手法として、（１）検索キーワードを用いた検索手法、（２）統計技術を用いたベクトル空間法（ＶｅｃｔｏｒＳｐａｃｅＭｏｄｅｌ：ＶＳＭ）などがあった。ベクトル空間法ＶＳＭでは、テキスト文書をベクトル特徴量で表現し、例えば、ＴＦ^*ＩＤＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ^* ＩｎｖｅｒｔｅｄＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）値をベクトル特徴量とする方法であり、ベクトル特徴量間の類似度を用いてテキスト間の類似度を求める。 There are (1) a search method using a search keyword and (2) a vector space method (Vector Space Model: VSM) using a statistical technique as main methods of information search using a computer. In the vector space method VSM, a text document is expressed by a vector feature amount, for example, a TF ^* IDF (Term Frequency ^* Inverted Document Frequency) value is used as a vector feature amount, and the similarity between the vector feature amounts is used. Find the similarity between texts.

しかしながら、ベクトル空間法は統計理論に基づく技術であり、テキストの意味内容に基づく検索方法ではないので検索の精度に限界があり、例文ベース機械翻訳に対応できない、という問題があった。 However, the vector space method is a technique based on statistical theory, and is not a search method based on the semantic content of text. Therefore, there is a limit to the accuracy of search, and there is a problem that it cannot cope with example sentence-based machine translation.

テキスト文の意味情報はテキスト文の内容を表し、構文構造、文の形式に影響されないため、意味情報を用いた情報比較、或いは検索の研究が成されている。例えば、非特許文献１には、格文法でテキストの意味関係を表し、木のマッチング方法で意味関係を比較する方法が記載されている。しかしながら、この方法の検索精度はベクトル空間法より低い、という問題がった。 Since the semantic information of the text sentence represents the content of the text sentence and is not affected by the syntax structure or the form of the sentence, research on information comparison or retrieval using the semantic information has been conducted. For example, Non-Patent Document 1 describes a method in which semantic relations of text are expressed by case grammar, and semantic relations are compared by a tree matching method. However, there is a problem that the search accuracy of this method is lower than that of the vector space method.

また、非特許文献２には、部分関係マッチングの方法とベクトル空間法とを融合した方法が記載されている。しかしながら、この方法は、ベクトル空間法と比べて検索カバー率は優れるものの、検索精度が低い、という問題があった。 Non-Patent Document 2 describes a method in which a partial relationship matching method and a vector space method are merged. However, this method has a problem that the search accuracy is low although the search coverage is superior to the vector space method.

また、非特許文献３には、テキスト文の因果関係を利用した検索方法が記載されている。しかしながら、この方法は、基本的にはキーワード検索の改良方法であり、例文ベース機械翻訳に必要な高精度なテキスト文検索に対応できない、という問題があった。 Non-Patent Document 3 describes a search method using a causal relationship between text sentences. However, this method is basically an improved method of keyword search, and has a problem that it cannot cope with high-precision text sentence search required for example sentence-based machine translation.

また、特許文献１には、類義語の関係にある複数の種類の単語を１種類の単語とみなして、その単語の出現頻度を含むベクトルデータ（更新ベクトルデータ）を作成し、このベクトルデータにより検索キー文書と検索対象文書間の類似度を計算する技術が記載されている。この技術によれば、同じ意味を持ちながら表記が異なる単語が１文書中に混在する場合、或いは、比較される各文書に含まれる単語が同じ意味を持ちながら表記が異なる場合でも、信頼性の高い類似文書検索を実現することができる。 Further, in Patent Document 1, a plurality of types of words having synonym relations are regarded as one type of word, and vector data (updated vector data) including the appearance frequency of the words is created, and search is performed using this vector data. A technique for calculating the similarity between a key document and a search target document is described. According to this technique, even when words having the same meaning but different notations are mixed in one document, or even when words included in each document to be compared have the same meaning but different notations, High similarity document search can be realized.

また、特許文献２には、入力された文字列を分かち書きし、形態素情報を付与するとともに、前記形態素情報を基にして文節間の係り受け関係を解析し、この解析結果から文構造を決定し、この文構造から索引を抽出すると共に索引の重要度を付与し、入力文書と蓄積されている文書との類似度を索引の類似度と係り受け関係の類似度から判定する技術が記載されている。 In Patent Document 2, the input character string is divided and assigned with morpheme information, and the dependency relation between phrases is analyzed based on the morpheme information, and the sentence structure is determined from the analysis result. , A technique for extracting an index from this sentence structure, assigning the importance of the index, and determining the similarity between the input document and the stored document from the similarity of the index and the similarity of the dependency relationship is described. Yes.

また、特許文献３には、入力された日本語文書を知識ベースを用いて構文解析／意味解析／文脈解析の日本語解析を行い、一定の正規化した命題形式の深層構造で表現し、質問文の深層構造とテーマとの意味照合を行い、当該意味照合に適合したテーマ及び当該テーマと関係するテーマ配下の深層構造との意味照合を行い、当該意味照合に適合した深層構造を持つ文書を出力する技術が記載されている。
Lu、 X. (1990). "An application of case relations to document retrieval (Doctoral dissertation, University of Western Ontario, 1990)". Dissertation Abstracts International, 52-10、 3464A Liu、 G.Z. (1997). "Semantic vector space model: Implementation and evaluation". Journal of the American Society for Information Science, 48(5), 395-417 Khoo、 Christopher Soo-Guan (1997). "The use of relation matching in information retrieval". Electronic Journal ISSN 1058-6768 特開平１１−１１０３９５号公報特開平３−１７２９６６号公報特開平１−２１６２４号公報 Japanese Patent Application Laid-Open Publication No. 2003-259542 discloses a Japanese language document that is syntactically analyzed / semantically analyzed / context analyzed using a knowledge base, expressed in a deep structure of a certain normalized propositional format, Check the semantics of the deep structure of the sentence and the theme, check the semantics of the theme that conforms to the semantic collation and the deep structure under the theme related to the theme, and create a document that has the deep structure that conforms to the semantic collation. The technology to output is described.
Lu, X. (1990). "An application of case relations to document retrieval (Doctoral dissertation, University of Western Ontario, 1990)". Dissertation Abstracts International, 52-10, 3464A Liu, GZ (1997). "Semantic vector space model: Implementation and evaluation". Journal of the American Society for Information Science, 48 (5), 395-417 Khoo, Christopher Soo-Guan (1997). "The use of relation matching in information retrieval". Electronic Journal ISSN 1058-6768 Japanese Patent Application Laid-Open No. 11-110395 Japanese Patent Laid-Open No. 3-172966 Japanese Unexamined Patent Publication No. 1-221624

しかしながら、特許文献１〜３に記載された技術では、意味内容を考慮して高精度にテキスト文を検索することができない場合がある、という問題があった。 However, the techniques described in Patent Documents 1 to 3 have a problem that a text sentence may not be searched with high accuracy in consideration of the meaning content.

本発明は上記事実を考慮して成されたものであり、テキスト文の意味内容を考慮して高精度にテキスト文を検索することができるテキスト文検索装置、テキスト文検索方法、及びテキスト文検索プログラムを得ることを目的とする。 The present invention has been made in consideration of the above facts, a text sentence search device, a text sentence search method, and a text sentence search capable of searching a text sentence with high precision in consideration of the semantic content of the text sentence. The purpose is to obtain a program.

上記目的を達成するために、請求項１に記載の発明のテキスト文検索装置は、質問テキスト文を単語に切り分けると共に、切り分けた単語に品詞を付与し、単語切り分け・品詞付与データを生成する単語切り分け・品詞付与手段と、前記単語切り分け・品詞付与データに基づいて、前記質問テキスト文に含まれる意味チャンク、前記意味チャンクの中心単語、及び前記意味チャンクの格を解析する構文・意味解析手段と、検索対象のテキスト文集合に出現する単語の各々についての前記検索対象のテキスト文集合に出現する頻度を表す出現頻度データが予め記憶された出現頻度データ記憶手段と、前記意味チャンクの格に基づいて、当該意味チャンクに含まれる単語の意味チャンク重み及び中心単語重みの少なくとも一方を設定する設定手段と、前記質問テキスト文に含まれる単語についての前記出現頻度データと、前記意味チャンク重み及び前記中心単語重みの少なくとも一方と、に基づいて、前記質問テキスト文に含まれる単語の重みを各々算出することにより前記質問テキスト文の意味ベクトル特徴量を生成する特徴量生成手段と、前記検索対象のテキスト文集合の各テキスト文の意味ベクトル特徴量を予め記憶した意味ベクトル特徴量記憶手段と、前記質問テキスト文の意味ベクトル特徴量と検索対象のテキスト文の意味ベクトル特徴量とに基づいて、前記質問テキスト文と前記検索対象のテキスト文との類似度を計算する類似度計算手段と、を備えたことを特徴とする。 In order to achieve the above object, a text sentence search device according to claim 1, which divides a question text sentence into words, assigns parts of speech to the separated words, and generates word separation / part of speech provision data. And a syntactic / semantic analyzing unit that analyzes a semantic chunk included in the question text sentence, a central word of the semantic chunk, and a case of the semantic chunk based on the word segmenting / part of speech providing unit, Based on the appearance frequency data storage means in which appearance frequency data representing the frequency of appearance in the text sentence set to be searched for each word appearing in the text sentence set to be searched is stored in advance, and the case of the semantic chunk Setting means for setting at least one of a semantic chunk weight and a central word weight of a word included in the semantic chunk Calculating the weight of each word included in the question text sentence based on the appearance frequency data for the word included in the question text sentence and at least one of the semantic chunk weight and the central word weight. The feature quantity generating means for generating the semantic vector feature quantity of the question text sentence, the semantic vector feature quantity storing means for storing in advance the semantic vector feature quantity of each text sentence of the text sentence set to be searched, and the question text A similarity calculation means for calculating a similarity between the question text sentence and the text sentence to be searched based on a semantic vector feature quantity of the sentence and a semantic vector feature quantity of the text sentence to be searched; It is characterized by.

この発明よれば、単語切り分け・品詞付与手段は、質問テキスト文を単語に切り分け、切り分けられた各単語に名詞、動詞、助詞等の品詞を付与する。 According to this invention, the word segmentation / part-of-speech giving means segmentes the question text sentence into words, and gives part-of-speech such as a noun, verb, particle, etc. to each segmented word.

構文・意味解析手段は、単語切り分け・品詞付与手段により生成された単語切り分け・品詞付与データに基づいて、質問テキスト文に含まれる意味チャンク、意味チャンクの中心単語、及び意味チャンクの格を解析する。ここで、意味チャンクとは一つのまとまった意味を構成する単語の集まりである。また、中心単語とは、意味チャンクの中で意味内容として重要な役割を果たす単語である。中心単語は、意味チャンクの中に一つ又は複数存在する。また、格とは例えば動作主、述語の対象等、単語の意味格を表すものである。 The syntax / semantic analysis unit analyzes the semantic chunk, the central word of the semantic chunk, and the case of the semantic chunk included in the question text sentence based on the word segmentation / part of speech assignment data generated by the word segmentation / part of speech assignment unit. . Here, the semantic chunk is a collection of words constituting a single meaning. The central word is a word that plays an important role as semantic content in the semantic chunk. One or more central words exist in the semantic chunk. The case represents a semantic case of a word, such as an operation subject or a predicate target.

出現頻度データ記憶手段は、検索対象のテキスト文集合、すなわち質問テキスト文の比較対象である複数のテキスト文に出現する単語の各々について算出された出現頻度データを予め記憶している。出現頻度データは、検索対象のテキスト文集合に出現する単語がそのテキスト文集合に出現する頻度を表すデータである。 The appearance frequency data storage means stores in advance appearance frequency data calculated for each of the words appearing in a plurality of text sentences that are comparison targets of the text sentence set to be searched, that is, the question text sentence. The appearance frequency data is data representing the frequency with which words appearing in the text sentence set to be searched appear in the text sentence set.

設定手段は、構文・意味解析手段により解析された意味チャンクの格に基づいて、当該意味チャンクに含まれる単語の意味チャンク重み及び中心単語重みの少なくとも一方を設定する。具体的には、例えば重要度に応じて格を複数のクラスに予め分類し、この格の重要度に応じて意味チャンク重み及び中心単語重みが設定される。 The setting means sets at least one of a semantic chunk weight and a central word weight of a word included in the semantic chunk based on the case of the semantic chunk analyzed by the syntax / semantic analysis means. Specifically, for example, cases are classified into a plurality of classes according to importance, and semantic chunk weights and central word weights are set according to the importance of the cases.

特徴量生成手段は、質問テキスト文に含まれる各単語の出現頻度データを出現頻度データ記憶手段から読み出し、読み出した出現頻度データと、設定手段によって設定された意味チャンクに含まれる単語の意味チャンク重み及び中心単語重みの少なくとも一方と、に基づいて、質問テキスト文に含まれる単語の重みを各々算出する。このように算出された各単語の重みによって質問テキスト文の意味ベクトル特徴量が構成される。 The feature quantity generation means reads appearance frequency data of each word included in the question text sentence from the appearance frequency data storage means, and the read appearance frequency data and the semantic chunk weight of the word included in the semantic chunk set by the setting means And the weights of the words included in the question text sentence are calculated based on at least one of the central word weights. The meaning vector feature amount of the question text sentence is constituted by the weight of each word thus calculated.

意味ベクトル特徴量記憶手段は、検索対象のテキスト文集合の各テキスト文の意味ベクトル特徴量を予め記憶している。検索対象のテキスト文集合の各テキスト文の意味ベクトル特徴量は、質問テキスト文の意味ベクトル特徴量を生成するのと同様にして生成されたものを用いることができる。 The semantic vector feature amount storage means stores in advance the semantic vector feature amount of each text sentence in the set of text sentences to be searched. As the semantic vector feature amount of each text sentence in the set of text sentences to be searched, those generated in the same manner as generating the semantic vector feature quantity of the question text sentence can be used.

類似度計算手段は、特徴量生成手段によって生成された質問テキスト文の意味ベクトル特徴量と、意味ベクトル特徴量記憶手段に記憶された検索対象のテキスト文の意味ベクトル特徴量とに基づいて、質問テキスト文と検索対象のテキスト文との類似度を計算する。 The similarity calculation unit is configured to calculate a question based on the meaning vector feature amount of the question text sentence generated by the feature amount generation unit and the meaning vector feature amount of the text sentence to be searched stored in the meaning vector feature amount storage unit. The similarity between the text sentence and the text sentence to be searched is calculated.

このように、本発明によれば、意味チャンクの格の重要度に応じて設定された意味チャンク重みや中心単語重みが考慮されて意味ベクトル特徴量が生成されるため、意味内容を適切に考慮した類似度計算を行うことができ、質問テキスト文に類似するテキスト文を高精度に検索することが可能となる。 As described above, according to the present invention, the semantic chunk feature weight and the central word weight set in accordance with the significance level of the semantic chunk are taken into consideration to generate the semantic vector feature amount, so that the semantic content is appropriately considered. Similarity calculation can be performed, and a text sentence similar to the question text sentence can be searched with high accuracy.

なお、請求項２に記載したように、前記特徴量生成手段は、前記質問テキスト文に含まれる単語の重みを、当該単語が前記質問テキスト文に出現する頻度に基づく第１のパラメータに、前記意味チャンク重み及び前記中心単語重みの少なくとも一方を加算した加算値に、当該単語が前記検索対象のテキスト文集合に出現する頻度に基づく第２のパラメータを乗算することにより算出することができる。 In addition, as described in claim 2, the feature amount generation unit uses the weight of a word included in the question text sentence as a first parameter based on a frequency with which the word appears in the question text sentence. It can be calculated by multiplying an addition value obtained by adding at least one of the semantic chunk weight and the central word weight by a second parameter based on the frequency of occurrence of the word in the text sentence set to be searched.

また、請求項３に記載したように、前記特徴量生成手段は、前記質問テキスト文に含まれる単語の重みを、当該単語が前記質問テキスト文に出現する頻度に基づく第１のパラメータに、当該単語が前記検索対象のテキスト文集合に出現する頻度に基づく第２のパラメータを乗算した乗算値に、前記意味チャンク重み及び前記中心単語重みの少なくとも一方を含む係数を乗算することにより算出することができる。 In addition, as described in claim 3, the feature amount generation unit uses the weight of a word included in the question text sentence as a first parameter based on the frequency with which the word appears in the question text sentence. Calculating by multiplying a multiplication value obtained by multiplying a second parameter based on a frequency at which a word appears in the search target text sentence set by a coefficient including at least one of the semantic chunk weight and the central word weight. it can.

また、請求項４に記載したように、前記類似度計算手段は、前記質問テキスト文に含まれる単語及び前記検索対象のテキスト文に含まれる単語のうち一致する単語を検索し、当該検索した単語の各々の重みを乗算した値を前記一致した単語全てについて加算することにより前記類似度を計算することができる。 In addition, as described in claim 4, the similarity calculation unit searches for a matching word among a word included in the question text sentence and a word included in the search target text sentence, and the searched word The degree of similarity can be calculated by adding a value obtained by multiplying each of the weights for all the matched words.

また、請求項５に記載したように、前記類似度計算手段は、前記質問テキスト文に含まれる単語及び前記検索対象のテキスト文に含まれる単語のうち一致する単語を検索し、当該検索した単語の各々の重みの距離を算出し、当該算出した距離を前記一致した単語全てについて加算することにより前記類似度を計算することができる。 Further, as described in claim 5, the similarity calculation unit searches for a word that matches between a word included in the question text sentence and a word included in the search target text sentence, and the searched word The degree of similarity can be calculated by calculating the distance of the respective weights and adding the calculated distance to all the matched words.

請求項６記載の発明のテキスト文検索方法は、質問テキスト文を単語に切り分けると共に、切り分けた単語に品詞を付与し、単語切り分け・品詞付与データを生成し、前記単語切り分け・品詞付与データに基づいて、前記質問テキスト文に含まれる意味チャンク、前記意味チャンクの中心単語、及び前記意味チャンクの格を解析し、前記意味チャンクの格に基づいて、当該意味チャンクに含まれる単語の意味チャンク重み及び中心単語重みの少なくとも一方を設定し、検索対象のテキスト文集合に出現する単語の各々についての前記検索対象のテキスト文集合に出現する頻度を表す出現頻度データのうち、前記質問テキスト文に含まれる単語についての前記出現頻度データと、前記意味チャンク重み及び前記中心単語重みの少なくとも一方と、に基づいて、前記質問テキスト文に含まれる単語の重みを各々算出することにより前記質問テキスト文の意味ベクトル特徴量を生成し、前記質問テキスト文の意味ベクトル特徴量と検索対象のテキスト文の意味ベクトル特徴量とに基づいて、前記質問テキスト文と前記検索対象のテキスト文との類似度を計算する、ことを特徴とする。 According to a sixth aspect of the present invention, there is provided a text sentence search method for segmenting a question text sentence into words, adding part of speech to the segmented word, generating word segmentation / part of speech provision data, and based on the word segmentation / part of speech provision data. Analyzing the semantic chunk included in the question text sentence, the central word of the semantic chunk, and the case of the semantic chunk, and based on the case of the semantic chunk, the semantic chunk weight of the word included in the semantic chunk and At least one of the central word weights is set and included in the question text sentence out of the appearance frequency data representing the frequency of appearance in the search target text sentence set for each word appearing in the search target text sentence set The appearance frequency data for a word, and at least one of the semantic chunk weight and the central word weight; Based on the above, the semantic vector feature amount of the question text sentence is generated by calculating the weight of each word included in the question text sentence, and the meaning vector feature amount of the question text sentence and the meaning of the text sentence to be searched The similarity between the question text sentence and the search target text sentence is calculated based on a vector feature amount.

この発明によれば、意味チャンクの格の重要度に応じて設定された意味チャンク重みや中心単語重みが考慮されて意味ベクトル特徴量が生成されるため、意味内容を適切に考慮した類似度計算を行うことができ、質問テキスト文に類似するテキスト文を高精度に検索することが可能となる。 According to the present invention, since the semantic vector feature amount is generated in consideration of the semantic chunk weight and the central word weight set in accordance with the importance of the meaning of the semantic chunk, the similarity calculation considering the semantic content appropriately It is possible to search for a text sentence similar to the question text sentence with high accuracy.

請求項７記載の発明のテキスト文検索プログラムは、質問テキスト文を単語に切り分けると共に、切り分けた単語に品詞を付与し、単語切り分け・品詞付与データを生成するステップと、前記単語切り分け・品詞付与データに基づいて、前記質問テキスト文に含まれる意味チャンク、前記意味チャンクの中心単語、及び前記意味チャンクの格を解析するステップと、前記意味チャンクの格に基づいて、当該意味チャンクに含まれる単語の意味チャンク重み及び中心単語重みの少なくとも一方を設定するステップと、検索対象のテキスト文集合に出現する単語の各々についての前記検索対象のテキスト文集合に出現する頻度を表す出現頻度データのうち、前記質問テキスト文に含まれる単語についての前記出現頻度データと、前記意味チャンク重み及び前記中心単語重みの少なくとも一方と、に基づいて、前記質問テキスト文に含まれる単語の重みを各々算出することにより前記質問テキスト文の意味ベクトル特徴量を生成するステップと、前記質問テキスト文の意味ベクトル特徴量と検索対象のテキスト文の意味ベクトル特徴量とに基づいて、前記質問テキスト文と前記検索対象のテキスト文との類似度を計算するするステップと、を含む処理をコンピュータに実行させることを特徴とする。 The text sentence search program of the invention according to claim 7 divides a question text sentence into words, assigns parts of speech to the divided words, generates word separation / part of speech provision data, and the word separation / part of speech provision data. Analyzing the semantic chunk included in the question text sentence, the central word of the semantic chunk, and the case of the semantic chunk, and based on the case of the semantic chunk, the word of the word included in the semantic chunk Of the appearance frequency data representing the frequency of appearing in the search target text sentence set for each of the words appearing in the search target text sentence set, the step of setting at least one of the semantic chunk weight and the central word weight, The appearance frequency data for the words included in the question text sentence and the semantic chunk Generating a semantic vector feature amount of the question text sentence by calculating a weight of each word included in the question text sentence based on at least one of the first and the central word weights; and the question text sentence Calculating a similarity between the question text sentence and the text sentence to be searched based on the meaning vector feature quantity of the text and the semantic vector feature quantity of the text sentence to be searched. It is characterized by making it.

この発明によれば、意味チャンクの格の重要度に応じて設定された意味チャンク重みや中心単語重みが考慮されて意味ベクトル特徴量が生成されるため、意味内容を適切に考慮した類似度計算を行うことができ、質問テキスト文に類似するテキスト文を高精度に検索することが可能なコンピュータを実現することができる。 According to the present invention, since the semantic vector feature amount is generated in consideration of the semantic chunk weight and the central word weight set in accordance with the importance of the meaning of the semantic chunk, the similarity calculation considering the semantic content appropriately It is possible to implement a computer that can search a text sentence similar to the question text sentence with high accuracy.

請求項８記載の発明のテキスト文検索装置は、質問テキスト文の言語意味解析を行い、前記質問テキスト文の各単語の意味上の重要度を設定する意味解析部と、前記言語意味解析の解析結果及びベクトル空間法を用いて前記質問テキスト文の特徴量を生成する特徴量生成部と、前記特徴量生成部で生成された質問テキスト文の特徴量と検索目標テキスト文の特徴量との類似度を計算する類似度計算部と、前記類似度の計算結果に基づいて、検索目標テキスト文集合から検索結果であるテキスト文を抽出する検索結果抽出部と、を具備することを特徴とする。 The text sentence search device according to claim 8 performs a linguistic semantic analysis of a question text sentence, sets a semantic importance of each word of the question text sentence, and analyzes the linguistic semantic analysis. A feature amount generation unit that generates a feature amount of the question text sentence using a result and a vector space method; and a similarity between the feature amount of the question text sentence generated by the feature amount generation unit and the feature amount of the search target text sentence A similarity calculation unit that calculates a degree; and a search result extraction unit that extracts a text sentence as a search result from a set of search target text sentences based on the calculation result of the similarity.

この発明によれば、質問テキスト文の言語意味解析を行い、質問テキスト文の各単語の意味上の重要度を設定し、言語意味解析の解析結果及びベクトル空間法を用いて質問テキスト文の特徴量を生成し、生成された質問テキスト文の特徴量と検索目標テキスト文の特徴量との類似度を計算し、類似度の計算結果に基づいて、検索目標テキスト文集合から検索結果であるテキスト文を抽出するので、意味内容を適切に考慮した類似度計算を行うことができ、質問テキスト文に類似するテキスト文を高精度に検索することが可能となる。 According to the present invention, the language semantic analysis of the question text sentence is performed, the semantic importance of each word of the question text sentence is set, the analysis result of the language semantic analysis and the feature of the question text sentence using the vector space method The amount of the generated query text sentence and the feature quantity of the search target text sentence are calculated, and based on the similarity calculation result, the text that is the search result from the search target text sentence set Since the sentence is extracted, it is possible to perform the similarity calculation considering the semantic content appropriately, and it is possible to search the text sentence similar to the question text sentence with high accuracy.

なお、請求項９に記載したように、前記意味解析部は、格文法に基づく言語意味解析を行うことができる。 In addition, as described in claim 9, the semantic analysis unit can perform language semantic analysis based on case grammar.

また、請求項１０に記載したように、前記類似度計算部は、内積、コサイン関数、及びユークリッド距離の内少なくとも１つを用いて類似度を計算することができる。 The similarity calculation unit can calculate the similarity using at least one of an inner product, a cosine function, and a Euclidean distance.

また、請求項１１に記載したように、前記特徴量生成部は、前記ベクトル空間法を用いて得られる特徴量を次式により算出することができる。 In addition, as described in claim 11, the feature quantity generation unit can calculate a feature quantity obtained by using the vector space method by the following equation.

ＴＦｉ×ｌｏｇ（Ｎ／ｎｉ+ｃ）
ここで、ＴＦｉはテキスト文に単語ｉが出現した回数、Ｎは検索目標テキスト文集合のテキスト文の総数、ｎｉは検索目標テキスト文集合に単語ｉを含むテキスト文の総数、ｃは定数で、ｃ≧０．０１に設定される。 TFi x log (N / ni + c)
Here, TFi is the number of times the word i appears in the text sentence, N is the total number of text sentences in the search target text sentence set, ni is the total number of text sentences including the word i in the search target text sentence set, c is a constant, c ≧ 0.01 is set.

また、請求項１２に記載したように、前記意味解析部は、テキスト文中の各意味格を抽出し、抽出された意味格の重要度に基づいて、意味格に異なる重みを付与する意味格解析部、及びテキスト文中の中心的な役割を担当する単語を抽出し、抽出された中心単語の重要度に基づいて、中心単語に異なる重みを付与する中心単語解析部の内少なくとも一方を有する構成とすることができる。 In addition, as described in claim 12, the semantic analysis unit extracts each semantic case in the text sentence and assigns a different weight to the semantic case based on the importance of the extracted semantic case. And a word having a central role in the text sentence, and a configuration having at least one of a central word analysis unit for assigning different weights to the central word based on the importance of the extracted central word; can do.

この場合、請求項１３に記載したように、前記特徴量生成部は、前記意味格解析部で求められた意味格の重みと、前記中心単語解析部で求められた中心単語重みとを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正することができる。 In this case, as described in claim 13, the feature amount generation unit uses the semantic case weight obtained by the semantic case analysis unit and the central word weight obtained by the central word analysis unit. The feature amount obtained by using the vector space method can be corrected by the following equation.

(ＴＦｉ+Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ+Ｈｅａｄ＿Ｗｅｉｇｈｔ)×ｌｏｇ（Ｎ／ｎｉ+ｃ）
ここで、ＴＦｉはテキスト文に単語ｉが出現した回数、Ｃｈｕｎｋ＿Ｗｅｉｇｈｔは単語ｉが属している意味格重み、Ｈｅａｄ＿Ｗｅｉｇｈｔは中心単語重み(単語ｉが中心単語の場合に使用)、Ｎは検索目標テキスト文集合のテキスト文の総数、ｎｉは検索目標テキスト文集合に単語ｉを含むテキスト文の総数、ｃは定数で、ｃ≧０．０１に設定される。 (TFi + Chunk_Weight + Head_Weight) × log (N / ni + c)
Where TFi is the number of times the word i appears in the text sentence, Chunk_Weight is the semantic weight to which the word i belongs, Head_Weight is the central word weight (used when the word i is the central word), and N is the search target text sentence The total number of text sentences in the set, ni is the total number of text sentences including the word i in the search target text sentence set, c is a constant, and c ≧ 0.01.

また、請求項１４に記載したように、前記特徴量生成部は、前記意味格解析部で求められた意味格の重みを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正するようにしてもよい。 In addition, as described in claim 14, the feature amount generation unit uses the weight of the semantic case obtained by the semantic case analysis unit to calculate a feature amount obtained by using the vector space method according to the following equation: You may make it correct.

(ＴＦｉ+Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ）×ｌｏｇ（Ｎ／ｎｉ+ｃ）
また、請求項１５に記載したように、前記特徴量生成部は、前記中心単語解析部で求められた中心単語重みを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正するようにしてもよい。 (TFi + Chunk_Weight) × log (N / ni + c)
In addition, as described in claim 15, the feature quantity generation unit corrects the feature quantity obtained by using the vector space method according to the following equation, using the central word weight obtained by the central word analysis unit. You may make it do.

(ＴＦｉ+Ｈｅａｄ＿Ｗｅｉｇｈｔ）×ｌｏｇ（Ｎ／ｎｉ+ｃ）
また、請求項１６に記載したように、前記特徴量生成部は、前記意味格解析部で求められた意味格の重みと、前記中心単語解析部で求められた中心単語重みとを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正するようにしてもよい。
（ＴＦｉ× ｌｏｇ（Ｎ／ｎｉ+ｃ））×（Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ+Ｈｅａｄ＿Ｗｅｉｇｈｔ）／２
また、請求項１７に記載したように、前記特徴量生成部は、前記意味格解析部で求められた意味格の重みを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正するようにしてもよい。 (TFi + Head_Weight) × log (N / ni + c)
In addition, as described in claim 16, the feature value generation unit uses the semantic case weight obtained by the semantic case analysis unit and the central word weight obtained by the central word analysis unit, You may make it correct the feature-value obtained using the said vector space method by following Formula.
(TFi × log (N / ni + c)) × (Chunk_Weight + Head_Weight) / 2
In addition, as described in claim 17, the feature value generation unit uses a weight of the semantic case obtained by the semantic case analysis unit to calculate a feature value obtained using the vector space method according to the following equation: You may make it correct.

ＴＦｉ× ｌｏｇ（Ｎ／ｎｉ+ｃ）×Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ
また、請求項１８に記載したように、前記特徴量生成部は、前記中心単語解析部で求められた中心単語重みを用いて、次式により前記ベクトル空間法を用いて得られる特徴量を修正するようにしてもよい。 TFi x log (N / ni + c) x Chunk_Weight
In addition, as described in claim 18, the feature value generation unit corrects the feature value obtained by using the vector space method according to the following expression, using the central word weight obtained by the central word analysis unit. You may make it do.

ＴＦｉ× ｌｏｇ（Ｎ／ｎｉ+ｃ）×Ｈｅａｄ＿Ｗｅｉｇｈｔ TFi × log (N / ni + c) × Head_Weight

本発明によれば、テキスト文の意味内容を考慮して高精度にテキスト文を検索することができる、という効果を有する。 According to the present invention, there is an effect that a text sentence can be searched with high accuracy in consideration of the semantic content of the text sentence.

以下、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described.

図１は、本発明に係る意味情報を考慮したテキスト文検索装置１０の概略構成を示すブロック図である。図１に示すように、テキスト文検索装置１０は、外部記憶装置１２、単語切り分け・品詞付与部１４、構文・意味解析部１６、特徴量生成部１８、統計データ記憶部２０、類似度計算部２２、データベース記憶部２４、記憶部２６、メモリ２８、３０、３２、３４を含んで構成されている。 FIG. 1 is a block diagram showing a schematic configuration of a text sentence search apparatus 10 considering semantic information according to the present invention. As shown in FIG. 1, the text sentence search device 10 includes an external storage device 12, a word segmentation / part of speech assignment unit 14, a syntax / semantic analysis unit 16, a feature amount generation unit 18, a statistical data storage unit 20, and a similarity calculation unit. 22, a database storage unit 24, a storage unit 26, and memories 28, 30, 32, and 34.

外部記憶装置１２には、例えばユーザーにより入力された質問テキスト文データが格納されている。メモリ２８には、外部記憶装置１２に格納されている一つの質問テキスト文データが記憶される。 The external storage device 12 stores, for example, question text sentence data input by a user. One question text sentence data stored in the external storage device 12 is stored in the memory 28.

単語切り分け・品詞付与部１４は、メモリ２８に格納されている質問テキスト文データの単語を切り分けると共に、切り分けた単語に品詞を付与し、その結果を単語切り分け・品詞付与データとしてメモリ３０に格納する。 The word segmentation / part-of-speech assigning unit 14 isolates the words of the question text sentence data stored in the memory 28, assigns the part of speech to the separated words, and stores the result in the memory 30 as the word segmentation / part-of-speech assignment data. .

構文・意味解析部１６は、メモリ３０に格納されている質問テキスト文の単語切り分け・品詞付与データを読み込み、この単語切り分け・品詞付与データに基づいて、質問テキスト文の構文解析及び意味解析を行い、その結果である構文・意味解析データをメモリ３２に格納する。 The syntax / semantic analysis unit 16 reads word segmentation / part-of-speech assignment data of the question text sentence stored in the memory 30, and performs syntax analysis and semantic analysis of the question text sentence based on the word segmentation / part-of-speech assignment data. The syntax / semantic analysis data as a result is stored in the memory 32.

統計データ記憶部２０には、予め事前に用意された検索対象のテキスト文集合（検索目標テキスト文集合）から、テキスト文集合に出現したすべての単語について、その単語を含むテキスト文の数（ＤＦ：ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）を算出した値（ＤＦ値）が出現頻度データとして記憶されている。すなわち、ＤＦ値は、その単語がテキスト文集合に出現する頻度を表している。また、統計データ記憶部２０には、収集可能な単語集合の自然数ＩＤデータも予め記憶されている。すなわち、各単語には固有の自然数のＩＤ（単語ＩＤ）が付与されており、統計データ記憶部２０には、単語ＩＤが付与された単語のＤＦ値が記憶されている。なお、同義語の単語に同一のＩＤを付与して格納するようにすることが好ましい。これにより、表記は異なるが意味が同じ単語が同等に扱われ、テキスト文の検索精度を高めることができる。 The statistical data storage unit 20 stores the number of text sentences including the words (DF) for all words that appear in the text sentence set from a set of search target text sentences (search target text sentence set) prepared in advance. : Document Frequency) (DF value) is stored as appearance frequency data. That is, the DF value represents the frequency that the word appears in the text sentence set. The statistical data storage unit 20 also stores in advance natural number ID data of collectable word sets. That is, a unique natural number ID (word ID) is assigned to each word, and the DF value of the word assigned the word ID is stored in the statistical data storage unit 20. Note that it is preferable to store the synonym word with the same ID. As a result, words having different notations but the same meaning are treated equally, and the text sentence search accuracy can be improved.

特徴量生成部１８は、メモリ３０に格納された質問テキスト文の単語切り分け・品詞付与データを読み込むと共に、統計データ記憶部２０に記憶された単語のＤＦ値のうちメモリ３０に格納された質問テキスト文の単語のＤＦ値を読み込む。そして、読み込んだ単語切り分け・品詞付与データ、質問テキスト文の単語のＤＦ値、メモリ３２に記憶された構文・意味解析データを用いて、質問テキスト文の意味ベクトル特徴量を生成し、そのデータをメモリ３４に格納する。 The feature quantity generation unit 18 reads the word segmentation / part of speech data of the question text sentence stored in the memory 30, and among the word DF values stored in the statistical data storage unit 20, the question text stored in the memory 30. Read the DF value of the word of the sentence. Then, using the read word segmentation / part-of-speech assignment data, the DF value of the word of the question text sentence, and the syntax / semantic analysis data stored in the memory 32, a semantic vector feature quantity of the question text sentence is generated, and the data is obtained. Store in the memory 34.

データベース記憶部２４には、検索対象であるテキスト文集合のすべてのテキスト文について予め算出した意味ベクトル特徴量が予め記憶されている。 The database storage unit 24 stores in advance semantic vector feature values calculated in advance for all text sentences in the set of text sentences to be searched.

類似度計算部２２は、メモリ３４に格納されている質問テキスト文の意味ベクトル特徴量とデータベース記憶部２４に格納されている検索対象のテキスト文の意味ベクトル特徴量との間の類似度（距離）を各々計算し、計算した類似度を記憶部２６に記憶する。これにより、質問テキスト文と最も類似度の高い検索対象のテキスト文を表示したり、類似度の高い順に検索対象のテキスト文の一覧を表示したりすることが可能となる。 The similarity calculation unit 22 calculates the similarity (distance) between the semantic vector feature amount of the question text sentence stored in the memory 34 and the semantic vector feature amount of the text sentence to be searched stored in the database storage unit 24. ), And the calculated similarity is stored in the storage unit 26. Thereby, it is possible to display a text sentence to be searched that has the highest similarity to the question text sentence, or to display a list of text sentences to be searched in descending order of similarity.

このように構成されたテキスト文検索装置１０は、例えば図２に示すような情報端末装置４０に適用することができる。 The text sentence search apparatus 10 configured as described above can be applied to an information terminal apparatus 40 as shown in FIG. 2, for example.

情報端末装置４０は、ハードディスク４２、キーボード４４、ディスプレイ４６、プロセッサ部４８から構成される。 The information terminal device 40 includes a hard disk 42, a keyboard 44, a display 46, and a processor unit 48.

ハードディスク４２は、キーボード４４から入力された質問テキスト文のデータ、プロセッサ部４８で計算された質問テキスト文と検索対象のテキスト文との間の類似度等の各種計算結果、各種ソフトウェア等が格納される。また、ハードディスク４２は、計算に必要な記憶空間としても利用される。なお、ハードディスク４２に限らず他の外部記憶装置を用いてもよい。 The hard disk 42 stores the data of the question text sentence input from the keyboard 44, various calculation results such as the similarity between the question text sentence calculated by the processor unit 48 and the text sentence to be searched, various software, and the like. The The hard disk 42 is also used as a storage space necessary for calculation. In addition, you may use not only the hard disk 42 but another external storage device.

キーボード４４は、ユーザがテキスト文を入力したり各種操作を指示するための入力装置である。なお、マウス等のその他の入力装置が設けられていてもよい。 The keyboard 44 is an input device for a user to input a text sentence and instruct various operations. Other input devices such as a mouse may be provided.

ディスプレイ４６は、ユーザーに対するメッセージやテキスト文のデータ、類似度の計算結果などを表示するための出力装置である。なお、プリンタ等の他の出力装置が設けられていてもよい。 The display 46 is an output device for displaying a message to the user, text text data, similarity calculation results, and the like. Other output devices such as a printer may be provided.

プロセッサ部４８は、ハードディスク４２に格納されているソフトウェアなどに従って、実際の処理を行う。プロセッサ部４８は、具体的にはマイクロプロセッサや、パーソナルコンピュータ等のコンピュータシステムで構成することができる。 The processor unit 48 performs actual processing according to software stored in the hard disk 42. Specifically, the processor unit 48 can be configured by a computer system such as a microprocessor or a personal computer.

上記の単語切り分け・品詞付与部１４、構文・意味解析部１６、特徴量生成部１８、及び類似度計算部２２は、このプロセッサ部４８上で動作する各種モジュールによって構成することができる。 The word segmentation / part-of-speech adding unit 14, the syntax / semantic analysis unit 16, the feature amount generation unit 18, and the similarity calculation unit 22 can be configured by various modules operating on the processor unit 48.

次に、テキスト文検索装置１０の具体的な動作について説明する。 Next, a specific operation of the text sentence search apparatus 10 will be described.

外部記憶装置１２には、ユーザーにより入力された質問テキスト文のデータが格納されており、外部記憶装置１２から一つのテキスト文データが読み出され、メモリ２８に記憶される。 The external storage device 12 stores data of the question text sentence input by the user. One text sentence data is read from the external storage device 12 and stored in the memory 28.

単語切り分け・品詞付与部１４では、テキスト文を構成している各単語を切り分け、それぞれの単語の品詞を付与し、単語切り分け・品詞付与データとしてメモリ３０に格納させる。なお、単語の切り分け及び品詞の付与の手法については一般に公開された公知の手法を用いることができる。例えば、中国の清華大学により開発された単語切り分け・品詞付与ツールを使用してもよいし、他の解析ツールを使用してもよい。 The word segmentation / part-of-speech assigning unit 14 isolates each word constituting the text sentence, assigns the part-of-speech for each word, and stores it in the memory 30 as word segmentation / part-of-speech provision data. Note that publicly known methods can be used for the method of segmenting words and assigning parts of speech. For example, a word segmentation / part-of-speech assignment tool developed by Tsinghua University in China may be used, or another analysis tool may be used.

一例として、図３に示す中国語のテキスト文５０が単語切り分け・品詞付与部１４に入力されると、単語切り分け・品詞付与データ５２が出力される。なお、中国語のテキスト文５０は、「２０００年９月２５日、北京時間月曜日午後行った女子４００メートル決勝戦で、オーストラリアの名将フェリマンは金メダルを獲得した。」という意味である。 As an example, when the Chinese text sentence 50 shown in FIG. 3 is input to the word segmentation / part-of-speech adding unit 14, word segmentation / part-of-speech provision data 52 is output. In addition, the Chinese text sentence 50 means that “Australian general Ferriman won a gold medal in the women ’s 400-meter final on September 25, 2000, Beijing time Monday afternoon”.

単語切り分け・品詞付与データ５２は、テキスト文５０を切り分けた各単語に名詞、動詞、助詞等の品詞を表す品詞記号が“／○”（○は任意の品詞を表すアルファベット）の形式で単語の語尾に付与されたデータである。例えば“／ｎｒ”で表される品詞記号は名詞を表す。なお、単語切り分け・品詞付与データ５２は、図３に示すフォーマットに限らず、切り分けられた各単語の品詞が判別できるものであれば他のフォーマットでもよい。 The word segmentation / part-of-speech assignment data 52 includes a word part in which the part-of-speech symbol representing a part-of-speech such as a noun, a verb, or a particle is “/ ○” (○ is an alphabet representing an arbitrary part-of-speech). Data given to the end of the word. For example, the part of speech symbol represented by “/ nr” represents a noun. Note that the word segmentation / part-of-speech giving data 52 is not limited to the format shown in FIG.

構文・意味解析部１６は、意味格解析部及び中心単語解析部の少なくとも一方を含み、メモリ３０に格納されているテキスト文の単語切り分け・品詞付与データを読み込み、質問テキスト文の構文・意味解析を行い、各単語の意味上の重要度を設定する。具体的には、質問テキスト文を構成する意味チャンク、中心単語、格（意味格）を解析し、解析結果を構文・意味解析データとしてメモリ３２に格納する。なお、構文解析及び意味解析の手法については一般に公開された公知の手法を用いることができる。例えば、中国の清華大学により開発された構文・意味解析ツールを使用してもよいし、他の解析ツールを使用してもよい。 The syntax / semantic analysis unit 16 includes at least one of a semantic case analysis unit and a central word analysis unit, reads word segmentation / part-of-speech assignment data of a text sentence stored in the memory 30, and syntax / semantic analysis of the question text sentence To set the semantic significance of each word. Specifically, the semantic chunk, the central word, and the case (semantic case) constituting the question text sentence are analyzed, and the analysis result is stored in the memory 32 as syntax / semantic analysis data. Note that publicly known methods can be used as methods for syntax analysis and semantic analysis. For example, a syntax / semantic analysis tool developed by Tsinghua University in China may be used, or another analysis tool may be used.

テキスト文は一般に複数の単語から構成されており、その複数の単語の中に、一つ或いは複数の重要な役割を果たす単語が含まれている。ここでは、テキスト文の中で重要な役割を果たしている単語を中心単語（Ｈｅａｄ）と呼ぶ。 A text sentence is generally composed of a plurality of words, and one or a plurality of words that play an important role are included in the plurality of words. Here, a word that plays an important role in a text sentence is called a central word (Head).

一例として、図４に示す中国語のテキスト文５４が構文・意味解析部１６に入力されると、構文・意味解析データ５６が出力される。なお、中国語のテキスト文５４は、以下のような意味である。 As an example, when a Chinese text sentence 54 shown in FIG. 4 is input to the syntax / semantic analysis unit 16, syntax / semantic analysis data 56 is output. The Chinese text sentence 54 has the following meaning.

構文・意味解析データ５６は、テキスト文５４から解析された意味チャンク毎に“［］”で切り分け、中心単語をゴジックフォントで表現した形式である。なお、括弧内の先頭に付与されたアルファベットは構文情報（チャンク名）を表し，括弧の後に付与されたアルファベットはその意味チャンクの格を表す。 The syntax / semantic analysis data 56 is in a format in which each semantic chunk analyzed from the text sentence 54 is separated by “[]” and the central word is expressed in a gossic font. In addition, the alphabet given at the head in parentheses represents syntax information (chunk name), and the alphabet given after the parentheses represents the case of the semantic chunk.

構文情報(チャンク名)を表すアルファベットは次のように定義する。 The alphabet that expresses syntax information (chunk name) is defined as follows.

Ｓ（Ｓｕｂｊｅｃｔ）…主語チャンク
Ｐ（Ｐｒｅｄｉｃａｔｅ）…述語チャンク
Ｏ（Ｏｂｊｅｃｔ）…受事チャンク
Ｄ…副詞チャンク
Ｃ（Ｃｏｍｐｌｅｍｅｎｔ）…補文チャンク
Ｖ…述語或いは述語性チャンク
また、意味チャンクの格について、一文に二つ以上のＳ（動作主）、Ｏ（述語の対象）、Ｖ（動作）が含まれているときに、それぞれ、文の左から右への出現順位によって、Ｓ１、Ｓ２、…、Ｏ１、Ｏ２、…、Ｖ１、Ｖ２、…のように表記する． S (Subject) ... Subject chunk P (Predicate) ... Predicate chunk O (Object) ... Acceptance chunk D ... Adverb chunk C (Complement) ... Complement chunk V ... Predicate or predicate chunk One sentence about the meaning of a semantic chunk , O1 (subject of predicate), and V (motion) are included in S1, S2,..., O1 depending on the order of appearance from the left to the right of the sentence. , O2,..., V1, V2,.

なお、構文・意味解析データ５６は、図４に示すフォーマットに限らず、意味チャンク、中心単語、格が判別できるものであれば他のフォーマットでもよい。 The syntax / semantic analysis data 56 is not limited to the format shown in FIG. 4, but may be any other format as long as the semantic chunk, the central word, and the case can be discriminated.

特徴量生成部１８は、単語切り分け・品詞付与データ、質問テキスト文の単語のＤＦ値、構文・意味解析データを用いて、質問テキスト文の意味ベクトル特徴量を生成し、そのデータをメモリ３４に格納する。 The feature quantity generation unit 18 generates a semantic vector feature quantity of the question text sentence using the word segmentation / part of speech assignment data, the DF value of the word of the question text sentence, and the syntax / semantic analysis data, and stores the data in the memory 34. Store.

次に、意味ベクトル特徴量について説明するが、その前に、従来におけるベクトル特徴量について説明する。 Next, the semantic vector feature amount will be described. Prior to that, the conventional vector feature amount will be described.

テキスト文はＶ（ｆ１、ｆ２、ｆ３、…、ｆｓ）というベクトルで表現することができる。ここで、ｓは単語の総数である。また、ｆｉはテキスト文に出現する単語であってテキスト文の中に含まれる単語のうち単語ＩＤがｉ番目に大きい単語（以下、単語ｉという）の重み（特徴量）であり、次式で表される。 A text sentence can be expressed by a vector V (f1, f2, f3,..., Fs). Here, s is the total number of words. Further, fi is a weight (feature amount) of a word (hereinafter referred to as a word i) having a word ID that is an i-th largest word among the words that appear in the text sentence and is included in the text sentence. expressed.

ｆｉ=ＴＦｉ×ｌｏｇ（Ｎ／ｎｉ） …（１）
なお、ＴＦｉ（第１のパラメータ）はテキスト文の単語ｉの頻度、すなわちテキスト文に含まれる単語ｉの数である。従って、ＴＦｉはテキスト文の長さが長い程大きくなる傾向にある。ｌｏｇ（Ｎ／ｎｉ）は所謂ＩＤＦ（第２のパラメータ）であり、Ｎはデータベース記憶部２４に記憶された検索対象のテキスト文集合に含まれているテキスト文の総数、ｎｉは単語ｉのＤＦ値を表す。すなわち、ＩＤＦは、単語ｉが含まれるテキスト文の数が多いほど小さな値となり、単語ｉが含まれるテキスト文の数が少ないほど大きな値となる。従って、ＩＤＦは、対象となるテキスト文にとっての単語ｉの重要度を表している。 fi = TFi × log (N / ni) (1)
TFi (first parameter) is the frequency of the word i in the text sentence, that is, the number of words i included in the text sentence. Therefore, TFi tends to increase as the length of the text sentence increases. log (N / ni) is a so-called IDF (second parameter), N is the total number of text sentences included in the set of text sentences to be searched stored in the database storage unit 24, and ni is the DF of the word i. Represents a value. That is, the IDF becomes smaller as the number of text sentences including the word i is larger, and becomes larger as the number of text sentences including the word i is smaller. Accordingly, the IDF represents the importance of the word i for the target text sentence.

Ｖ（ｆ１、ｆ２、ｆ３、…、ｆｓ）の次元数ｓはテキスト文に出現した全単語数である。各テキスト文のベクトル特徴量については、各々次元数が異なる場合もあるし、各次元で対応している単語が異なる場合もある。 The dimension number s of V (f1, f2, f3,..., Fs) is the total number of words that appear in the text sentence. About the vector feature-value of each text sentence, the number of dimensions may differ, respectively, and the word corresponding to each dimension may differ.

また、前述したように、各単語ｉには自然数のＩＤが付与されており、ベクトル特徴量に基づいてテキスト文同士の類似度を計算するために、このＩＤを重みｆｉに対応させて、ベクトル特徴量をＶ（ＩＤ１、ｆ１、ＩＤ２、ｆ２、ＩＤ３、ｆ３、 …、Ｉｄｓ、ｆｓ）で表現する。なお、ＩＤ１、ＩＤ２…がＩＤ番号の小さい順となるように並べられる。 Further, as described above, each word i is given a natural number ID, and in order to calculate the similarity between text sentences based on the vector feature amount, this ID is associated with the weight fi, The feature amount is expressed by V (ID1, f1, ID2, f2, ID3, f3,..., Ids, fs). Note that ID1, ID2,... Are arranged in ascending order of ID numbers.

特徴量生成部１８は、統計データ記憶部２０から、メモリ２８に記録されているテキスト文の各単語のＤＦ値を読み込み、メモリ３２に格納された構文・意味解析データを用いて、テキスト文の意味ベクトル特徴量Ｓｖ（ＩＤ１、ｆ１、ＩＤ２、ｆ２、ＩＤ３、ｆ３、 …、Ｉｄｓ、ｆｓ）を作成する。 The feature quantity generation unit 18 reads the DF value of each word of the text sentence recorded in the memory 28 from the statistical data storage unit 20, and uses the syntax / semantic analysis data stored in the memory 32 to A semantic vector feature quantity Sv (ID1, f1, ID2, f2, ID3, f3,..., Ids, fs) is created.

次に、意味関係を表す格について説明する。 Next, a case representing a semantic relationship will be described.

図５に、意味関係を表す格及びその格を重要性により分類した分類表５８を示した。図５の例では、第１クラスから第５クラスに格が分類されており、第１クラスに分類された格が最も重要度が高い。第１クラスには、動作主等の重要性の高い固有名詞の意味格が含まれている。 FIG. 5 shows a case representing the semantic relationship and a classification table 58 in which the cases are classified according to importance. In the example of FIG. 5, the cases are classified from the first class to the fifth class, and the cases classified into the first class have the highest importance. The first class includes semantic cases of proper nouns such as main actors.

例えば図４に示した構文・意味解析データ５６の先頭の意味チャンクの格情報は“Ｓ”であるので、この意味チャンクの格のクラスは第１クラスとなる。同様に２番目の意味チャンクの格情報は“Ｈ”であるので、この意味チャンクの格のクラスも第１クラスとなる。このように、各意味チャンクの格は重要度に応じてクラス分けされる。 For example, since the case information of the first semantic chunk of the syntax / semantic analysis data 56 shown in FIG. 4 is “S”, the case class of this semantic chunk is the first class. Similarly, since the case information of the second semantic chunk is “H”, the case class of this semantic chunk is also the first class. In this way, the case of each semantic chunk is classified according to importance.

次に、意味ベクトル特徴量Ｓｖ（ＩＤ１、ｆ１、ＩＤ２、ｆ２、ＩＤ３、ｆ３、 …、Ｉｄｓ、ｆｓ）の生成方法について説明する。 Next, a method for generating the semantic vector feature amount Sv (ID1, f1, ID2, f2, ID3, f3,..., Ids, fs) will be described.

質問テキスト文のベクトル特徴量に対して、中心単語の重みを高くすれば、ベクトル特徴量は質問テキスト文の意味を反映しやすくなる。従って、特徴量生成部１８では、質問テキスト文に出現している単語の頻度、意味チャンク、中心単語を用いて、質問テキスト文の意味ベクトル特徴量を生成する。以下では、３つの意味ベクトル特徴量生成方法について説明する。 If the weight of the central word is increased with respect to the vector feature amount of the question text sentence, the vector feature amount easily reflects the meaning of the question text sentence. Therefore, the feature quantity generation unit 18 generates the semantic vector feature quantity of the question text sentence using the frequency of words appearing in the question text sentence, the semantic chunk, and the central word. Hereinafter, three semantic vector feature value generation methods will be described.

まず、第１の意味ベクトル特徴量生成方法について説明する。この方法では、まず意味チャンク重みＣｈｕｎｋ＿Ｗｅｉｇｈｔを格のクラスに応じて図６のように設定する。ここで、ａは定数である（ａ≧０）。すなわち、格のクラスが高い程、意味チャンク重みＣｈｕｎｋ＿Ｗｅｉｇｈｔも大きい値となる。なお、他の方法で意味チャンク重みを定義してもよい。 First, the first semantic vector feature value generation method will be described. In this method, first, the semantic chunk weight Chunk_Weight is set as shown in FIG. 6 according to the case class. Here, a is a constant (a ≧ 0). That is, the higher the case class is, the larger the semantic chunk weight Chunk_Weight becomes. The semantic chunk weight may be defined by other methods.

また、中心単語重みＨｅａｄ＿Ｗｅｉｇｈｔを格のクラスに応じて図７のように設定する。ここで、ｂは定数である（ｂ≧０）。すなわち、格のクラスが高い程、中心単語重みＨｅａｄ＿Ｗｅｉｇｈｔも大きい値となる。なお、他の方法で中心単語重みを定義してもよい。 Further, the central word weight Head_Weight is set as shown in FIG. 7 according to the case class. Here, b is a constant (b ≧ 0). That is, the higher the case class, the larger the central word weight Head_Weight. The central word weight may be defined by other methods.

そして、テキスト文の意味ベクトル特徴量Ｓｖ（ＩＤ１、ｆ１、ＩＤ２、ｆ２、ＩＤ３、ｆ３、…、ＩＤｓ、ｆｓ）を構成する単語ｉの重みｆｉを次式により各々求める。 Then, the weight fi of the word i constituting the semantic vector feature amount Sv (ID1, f1, ID2, f2, ID3, f3,..., IDs, fs) of the text sentence is obtained by the following equations.

ｆｉ=（ＴＦｉ+Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ+Ｈｅａｄ＿Ｗｅｉｇｈｔ）×ｌｏｇ（Ｎ／ｎｉ+ｃ） …（２）
ここで、Ｎは前述したようにテキスト文集合のテキスト文の総数である。ＴＦｉはテキスト文に出現した単語ｉの頻度、ｎｉは検索対象テキスト分集合の中に単語ｉを含むテキスト文の総数である。ｃは定数であり、ｃ≧０．０１に設定される。また、中心単語重みＨｅａｄ＿Ｗｅｉｇｈｔは、単語ｉが中心単語のときだけ使用する。単語ｉが中心単語ではない場合は、Ｈｅａｄ＿Ｗｅｉｇｈｔ＝０とする。 fi = (TFi + Chunk_Weight + Head_Weight) × log (N / ni + c) (2)
Here, N is the total number of text sentences in the text sentence set as described above. TFi is the frequency of the word i that appears in the text sentence, and ni is the total number of text sentences that include the word i in the search target text set. c is a constant and is set to c ≧ 0.01. The central word weight Head_Weight is used only when the word i is the central word. If the word i is not the central word, Head_Weight = 0.

第１の意味ベクトル特徴量生成方法における上記（２）式が、従来における上記（１）式と異なる点は、意味チャンク重みと中心単語重みとの和がＴＦｉに加算される点である。このため、格のクラスに応じて重みｆｉの値が変化し、従来と比較してテキスト文の意味が適切に反映された意味ベクトル特徴量が生成される。なお、第１の意味ベクトル特徴量生成方法は、以下で説明する第２、第３の意味ベクトル特徴量生成方法と比較して、意味チャンク重み及び中心単語重みの影響が小さいため、どちらかというと単文検索向きであり、機械翻訳等に適した方法であるといえる。 The difference between the expression (2) in the first semantic vector feature value generation method and the conventional expression (1) is that the sum of the semantic chunk weight and the central word weight is added to TFi. For this reason, the value of the weight fi changes according to the class of the case, and a semantic vector feature amount that appropriately reflects the meaning of the text sentence as compared with the conventional case is generated. Note that the first semantic vector feature value generation method is less influenced by the semantic chunk weight and the central word weight than the second and third semantic vector feature value generation methods described below. It is suitable for simple sentence search, and can be said to be a method suitable for machine translation and the like.

次に、第２の意味ベクトル特徴量生成方法について説明する。この方法では、第１の意味ベクトル特徴量生成方法と同様に、意味チャンク重みＣｈｕｎｋ＿Ｗｅｉｇｈｔを格のクラスに応じて図６のように設定するが、定数ａは、ａ≧１に設定される。 Next, a second semantic vector feature value generation method will be described. In this method, as in the first semantic vector feature value generation method, the semantic chunk weight Chunk_Weight is set as shown in FIG. 6 according to the case class, but the constant a is set to a ≧ 1.

また、中心単語重みＨｅａｄ＿Ｗｅｉｇｈｔについても第１の意味ベクトル特徴量生成方法と同様に、格のクラスに応じて図７のように設定するが、定数ｂは、ｂ≧１に設定される。 The central word weight Head_Weight is also set as shown in FIG. 7 according to the case class, as in the first semantic vector feature generation method, but the constant b is set to b ≧ 1.

ｆｉ =（ＴＦｉ× ｌｏｇ（Ｎ／ｎｉ+ｃ））×（Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ+Ｈｅａｄ＿Ｗｅｉｇｈｔ）／２ …（３）
ここで、定数ｃは、第１の意味ベクトル特徴量生成方法と同様に、ｃ≧０．０１に設定される。また、単語ｉが中心単語ではない場合は、Ｈｅａｄ＿Ｗｅｉｇｈｔ＝Ｃｈｕｎｋ＿Ｗｅｉｇｈｔとする。 fi = (TFi × log (N / ni + c)) × (Chunk_Weight + Head_Weight) / 2 (3)
Here, the constant c is set to c ≧ 0.01, as in the first semantic vector feature value generation method. When the word i is not the central word, Head_Weight = Chunk_Weight is set.

第２の意味ベクトル特徴量生成方法における上記（３）式が、従来における上記（１）式と異なる点は、上記（１）式で求められるｆｉに対して、意味チャンク重みと中心単語重みとの和を２で除算した値、すなわち意味チャンク重みと中心単語重みの平均値が乗算される点である。また、第２の意味ベクトル特徴量生成方法では、定数ａ、ｂが共に１以上の値に設定されるため、従来の上記（１）式で求められるｆｉに対して格のクラスに応じて大きな値となる傾向になる。従って、意味内容の重要度がより強く反映され、従来と比較してテキスト文の意味が適切に反映された意味ベクトル特徴量が生成される。 In the second semantic vector feature value generation method, the equation (3) differs from the conventional equation (1) in that the semantic chunk weight and the central word weight are different from the fi obtained by the equation (1). The value obtained by dividing the sum of the two by 2, that is, the average value of the semantic chunk weight and the central word weight is multiplied. Further, in the second semantic vector feature value generation method, the constants a and b are both set to a value of 1 or more. Therefore, the fi obtained by the conventional equation (1) is large depending on the class of the case. It tends to be a value. Therefore, the importance of the semantic content is reflected more strongly, and a semantic vector feature amount that appropriately reflects the meaning of the text sentence as compared with the conventional case is generated.

次に、第３の意味ベクトル特徴量生成方法について説明する。この方法では、第１の意味ベクトル特徴量生成方法と同様に、意味チャンク重みＣｈｕｎｋ＿Ｗｅｉｇｈｔを格のクラスに応じて図６のように設定するが、定数ａは、ａ≧１に設定される。 Next, a third semantic vector feature value generation method will be described. In this method, as in the first semantic vector feature value generation method, the semantic chunk weight Chunk_Weight is set as shown in FIG. 6 according to the case class, but the constant a is set to a ≧ 1.

ｆｉ =（ＴＦｉ×ｌｏｇ（Ｎ／ｎｉ+ｃ））×Ｃｈｕｎｋ＿Ｗｅｉｇｈｔ×Ｈｅａｄ＿Ｗｅｉｇｈｔ …（４）
ここで、定数ｃは、第１の意味ベクトル特徴量生成方法と同様に、ｃ≧０．０１に設定される。また、単語ｉが中心単語ではない場合は、Ｈｅａｄ＿Ｗｅｉｇｈｔ＝１とする。 fi = (TFi × log (N / ni + c)) × Chunk_Weight × Head_Weight (4)
Here, the constant c is set to c ≧ 0.01, as in the first semantic vector feature value generation method. When the word i is not the central word, Head_Weight = 1 is set.

第３の意味ベクトル特徴量生成方法における上記（４）式が、従来における上記（１）式と異なる点は、上記（１）式で求められるｆｉに対して、意味チャンク重みと中心単語重みとを乗算した値が乗算される点である。また、第３の意味ベクトル特徴量生成方法では、定数ａ、ｂが共に１以上の値に設定されるため、従来の上記（１）式で求められるｆｉに対して格のクラスに応じて大きな値となる傾向になる。従って、意味内容の重要度がより強く反映され、従来と比較してテキスト文の意味が適切に反映された意味ベクトル特徴量が生成される。なお、第３の意味ベクトル特徴量生成方法と第２の意味ベクトル特徴量生成方法とを比較すると、上記（４）式で求められるｆｉの方が、上記（３）式で求められるｆｉよりも大きな値となる傾向にある。 The point (4) in the third meaning vector feature generation method differs from the above point (1) in that the meaning chunk weight and the central word weight are different from the fi obtained in the above point (1). The value multiplied by is multiplied. In the third semantic vector feature value generation method, both constants a and b are set to a value of 1 or more, so that fi is large in accordance with the class of class obtained from the conventional equation (1). It tends to be a value. Therefore, the importance of the semantic content is reflected more strongly, and a semantic vector feature amount that appropriately reflects the meaning of the text sentence as compared with the conventional case is generated. When comparing the third semantic vector feature value generation method and the second semantic vector feature value generation method, the fi obtained by the above equation (4) is more than the fi obtained by the above equation (3). It tends to be a large value.

前述したように、データベース記憶部２４には、検索対象であるテキスト文集合の各テキスト文の意味ベクトル特徴量が予め記憶されているが、これらの意味ベクトル特徴量は、上記の意味ベクトル特徴量生成方法を用いて質問テキスト文の意味ベクトル特徴量と同様に生成することができる。 As described above, the database storage unit 24 stores in advance the semantic vector feature quantities of each text sentence in the text sentence set to be searched. These semantic vector feature quantities are the above-described semantic vector feature quantities. Using the generation method, it can be generated in the same way as the semantic vector feature amount of the question text sentence.

次に、類似度計算部２２における具体的な類似度の計算方法について説明する。 Next, a specific similarity calculation method in the similarity calculation unit 22 will be described.

類似度計算部２２では、以下に示す（５）〜（７）式の何れかによりテキスト文の類似度を計算する。 The similarity calculator 22 calculates the similarity of the text sentence according to any of the following formulas (5) to (7).

ここで、ｓ１とｓ２はそれぞれテキスト文Ｓ１とテキスト文Ｓ２の意味ベクトル特徴量の次元数を表し、ｆ１_kとｆ２_hはそれぞれテキスト文Ｓ１とテキスト文Ｓ２の意味ベクトル特徴量の各次元の重みを表す。また、ＩＤ１_kとＩＤ２_hはそれぞれテキスト文Ｓ１、テキスト文Ｓ２の意味ベクトル特徴量の各次元の単語ＩＤを表す。 Here, s1 and s2 represent the number of dimensions of the semantic vector feature quantity of the text sentence S1 and the text sentence S2, respectively, and f1 _k and f2 _h are weights of the respective dimension of the semantic vector feature quantity of the text sentence S1 and the text sentence S2, respectively. Represents. ID1 _k and ID2 _h represent word IDs of each dimension of the semantic vector feature amount of the text sentence S1 and the text sentence S2, respectively.

まず、上記（５）式による第１の類似度計算について、図８に示すフローチャートを参照して具体的に説明する。 First, the first similarity calculation by the above equation (5) will be specifically described with reference to the flowchart shown in FIG.

ステップ１００では、テキスト文Ｓ１用の添え字ｋ、テキスト文Ｓ２用の添え字ｈ、類似度Ｓｉｍ（Ｓ１、Ｓ２）を初期化する。すなわち、ｋ＝１、ｈ＝１、Ｓｉｍ（Ｓ１、Ｓ２）＝０とする。 In step 100, the subscript k for the text sentence S1, the subscript h for the text sentence S2, and the similarity Sim (S1, S2) are initialized. That is, k = 1, h = 1, and Sim (S1, S2) = 0.

ステップ１０２では、テキスト文Ｓ１とテキスト文Ｓ２の単語ＩＤが同一か否かを判断する。すなわちＩＤ１_k＝ＩＤ２_hか否かを判断する。そして、単語ＩＤが同一の場合は、ステップ１０４へ移行し、単語ＩＤが同一でない場合には、ステップ１０８へ移行する。 In step 102, it is determined whether or not the word IDs of the text sentence S1 and the text sentence S2 are the same. That is, it is determined whether or not ID1 _k = ID2 _h . If the word IDs are the same, the process proceeds to step 104. If the word IDs are not the same, the process proceeds to step 108.

ステップ１０４では、ｋ、ｈをそれぞれインクリメントすると共に、現在の類似度Ｓｉｍ（Ｓ１、Ｓ２）に重みｆ１_kとｆ２_hとの乗算値を加算し、これを新たな類似度Ｓｉｍ（Ｓ１、Ｓ２）とする。すなわち、ｋ＝ｋ＋１、ｈ＝ｈ＋１、Ｓｉｍ（Ｓ１、Ｓ２）=Ｓｉｍ（Ｓ１、Ｓ２）+ｆ１_k×ｆ２_hとする。 In step 104, k and h are respectively incremented, and a multiplication value of weights f1 _k and f2 _h is added to the current similarity Sim (S1, S2), and this is added to a new similarity Sim (S1, S2). And That is, the k = k + 1, h = h + 1, Sim (S1, S2) = Sim (S1, S2) + f1 k × f2 h.

ステップ１０６では、ｋが次元数ｓ１より大きいか否か、ｈが次元数ｓ２より大きいか否かを判断し、ｋが次元数ｓ１より大きいか又はｈが次元数ｓ２より大きい場合には、本ルーチンを終了し、それ以外の場合には、ステップＳ２へ戻って上記と同様の処理を繰り返す。 In step 106, it is determined whether k is greater than the dimension number s1, and whether h is greater than the dimension number s2, and if k is greater than the dimension number s1 or h is greater than the dimension number s2, The routine ends, otherwise, the process returns to step S2 to repeat the same processing as described above.

ステップ１０８では、テキスト文Ｓ１の単語ＩＤ１_kがテキスト文Ｓ２の単語ＩＤ２_hよりも小さいか否かを判断する。すなわちＩＤ１_k＜ＩＤ２_hの関係を満たすか否かを判断する。そして、ＩＤ１_k＜ＩＤ２_hの場合は、ステップ１１０へ移行し、そうでない場合、すなわちＩＤ１_k＞ＩＤ２_hの場合は、ステップ１１４へ移行する。 At step 108, it is determined whether a word ID1 _k text statement S1 is smaller than the word ID2 _h text sentence S2. That is, it is determined whether or not the relationship of ID1 _k <ID2 _h is satisfied. If ID1 _k <ID2 _h , the process proceeds to step 110; otherwise, that is, if ID1 _k > ID2 _h , the process proceeds to step 114.

ステップ１１０では、ｋをインクリメントしてステップ１１２へ移行する。ステップ１１２では、ｋが次元数ｓ１よりも大きいか否かを判断し、ｋが次元数ｓ１よりも大きい場合には本ルーチンを終了し、ｋが次元数ｓ１以下の場合には、ステップ１０２へ戻って上記と同様の処理を繰り返す。 In step 110, k is incremented and the routine proceeds to step 112. In step 112, it is determined whether or not k is larger than the number of dimensions s1, and if k is larger than the number of dimensions s1, this routine is terminated. If k is less than or equal to the number of dimensions s1, the process goes to step 102. Return and repeat the same process as above.

一方、ステップ１１４では、ｈをインクリメントし、ステップ１１６へ移行する。ステップ１１６では、ｈが次元数ｓ２よりも大きいか否かを判断し、ｈが次元数ｓ２よりも大きい場合には本ルーチンを終了し、ｈが次元数ｓ２以下の場合には、ステップ１０２へ戻って上記と同様の処理を繰り返す。 On the other hand, in step 114, h is incremented and the routine proceeds to step 116. In step 116, it is determined whether or not h is larger than the number of dimensions s2. If h is larger than the number of dimensions s2, this routine is ended. If h is smaller than or equal to the number of dimensions s2, the process proceeds to step 102. Return and repeat the same process as above.

このように、本ルーチンでは、テキスト文Ｓ１に含まれる単語の単語ＩＤとテキスト文Ｓ２に含まれる単語ＩＤとを比較することにより、両方のテキスト文に含まれる単語を検索し、両方のテキスト文に含まれる単語の重みを乗算し、この乗算した値を逐次加算してく。なお、前述したように、意味ベクトル特徴量Ｓｖ（ＩＤ１、ｆ１、ＩＤ２、ｆ２、ＩＤ３、ｆ３、…、ＩＤｓ、ｆｓ）のＩＤ１、ＩＤ２…は、ＩＤの小さい順に並べられているため、図８に示すような単純なフローにより両方のテキスト文に含まれる単語を速やかに検索することができる。 As described above, in this routine, by comparing the word ID of the word included in the text sentence S1 with the word ID included in the text sentence S2, the words included in both text sentences are searched, and both text sentences are searched. Multiply the weights of the words contained in, and add the multiplied values sequentially. As described above, since ID1, ID2,... Of the semantic vector feature amount Sv (ID1, f1, ID2, f2, ID3, f3,..., IDs, fs) are arranged in ascending order of ID, FIG. By using a simple flow as shown in Fig. 5, it is possible to quickly search for words included in both text sentences.

次に、上記（６）式による第２の類似度計算について、図９に示すフローチャートを参照して具体的に説明する。 Next, the second similarity calculation by the above equation (6) will be specifically described with reference to the flowchart shown in FIG.

ステップ２００では、上記（６）式の分母を計算する。すなわち、重みｆ１_kの二乗の総和と重みｆ２_hの二乗の総和との乗算値の平方根を求める。 In step 200, the denominator of the above equation (6) is calculated. That is, the square root of the product of the sum of the squares of the weights f1 _k and the sum of the squares of the weights f2 _h is obtained.

ステップ２０２では、テキスト文Ｓ１用の添え字ｋ、テキスト文Ｓ２用の添え字ｈ、類似度Ｓｉｍ（Ｓ１、Ｓ２）を初期化する。すなわち、ｋ＝１、ｈ＝１、Ｓｉｍ（Ｓ１、Ｓ２）＝０とする。 In step 202, the subscript k for the text sentence S1, the subscript h for the text sentence S2, and the similarity Sim (S1, S2) are initialized. That is, k = 1, h = 1, and Sim (S1, S2) = 0.

ステップ２０４では、テキスト文Ｓ１とテキスト文Ｓ２の単語ＩＤが同一か否かを判断する。すなわちＩＤ１_k＝ＩＤ２_hか否かを判断する。そして、単語ＩＤが同一の場合は、ステップ２０６へ移行し、単語ＩＤが同一でない場合には、ステップ２１２へ移行する。 In step 204, it is determined whether or not the word IDs of the text sentence S1 and the text sentence S2 are the same. That is, it is determined whether or not ID1 _k = ID2 _h . If the word IDs are the same, the process proceeds to step 206. If the word IDs are not the same, the process proceeds to step 212.

ステップ２０６では、ｋ、ｈをそれぞれインクリメントすると共に、現在の類似度Ｓｉｍ（Ｓ１、Ｓ２）に重みｆ１_kとｆ２_hとの乗算値を加算し、これを新たな類似度Ｓｉｍ（Ｓ１、Ｓ２）とする。すなわち、ｋ＝ｋ＋１、ｈ＝ｈ＋１、Ｓｉｍ（Ｓ１、Ｓ２）=Ｓｉｍ（Ｓ１、Ｓ２）+ｆ１_k×ｆ２_hとする。 In step 206, k and h are incremented, and a multiplication value of weights f1 _k and f2 _h is added to the current similarity Sim (S1, S2), and this is added to a new similarity Sim (S1, S2). And That is, the k = k + 1, h = h + 1, Sim (S1, S2) = Sim (S1, S2) + f1 k × f2 h.

ステップ２０８では、ｋが次元数ｓ１より大きいか否か、ｈが次元数ｓ２より大きいか否かを判断し、ｋが次元数ｓ１より大きいか又はｈが次元数ｓ２より大きい場合には、ステップ２１０へ移行し、それ以外の場合には、ステップ２０４へ戻って上記と同様の処理を繰り返す。 In step 208, it is determined whether or not k is larger than the dimension number s1, and whether or not h is larger than the dimension number s2. If k is larger than the dimension number s1 or h is larger than the dimension number s2, Otherwise, the process proceeds to 210. Otherwise, the process returns to step 204 and the same processing as described above is repeated.

ステップ２１２では、テキスト文Ｓ１の単語ＩＤ１_kがテキスト文Ｓ２の単語ＩＤ２_hよりも小さいか否かを判断する。すなわちＩＤ１_k＜ＩＤ２_hの関係を満たすか否かを判断する。そして、ＩＤ１_k＜ＩＤ２_hの場合は、ステップ２１４へ移行し、そうでない場合、すなわちＩＤ１_k＞ＩＤ２_hの場合は、ステップ２１８へ移行する。 At step 212, it is determined whether a word ID1 _k text statement S1 is smaller than the word ID2 _h text sentence S2. That is, it is determined whether or not the relationship of ID1 _k <ID2 _h is satisfied. If ID1 _k <ID2 _h , the process proceeds to step 214; otherwise, that is, if ID1 _k > ID2 _h , the process proceeds to step 218.

ステップ２１４では、ｋをインクリメントしてステップ２１６へ移行する。ステップ２１６では、ｋが次元数ｓ１よりも大きいか否かを判断し、ｋが次元数ｓ１よりも大きい場合にはステップ２１０へ移行し、ｋが次元数ｓ１以下の場合には、ステップ２０４へ戻って上記と同様の処理を繰り返す。 In step 214, k is incremented and the routine proceeds to step 216. In step 216, it is determined whether or not k is larger than the number of dimensions s1. If k is larger than the number of dimensions s1, the process proceeds to step 210. If k is less than or equal to the number of dimensions s1, the process proceeds to step 204. Return and repeat the same process as above.

一方、ステップ２１８では、ｈをインクリメントし、ステップ２２０へ移行する。ステップ２２０では、ｈが次元数ｓ２よりも大きいか否かを判断し、ｈが次元数ｓ２よりも大きい場合にはステップ２１０へ移行し、ｈが次元数ｓ２以下の場合には、ステップ２０４へ戻って上記と同様の処理を繰り返す。 On the other hand, at step 218, h is incremented and the routine proceeds to step 220. In step 220, it is determined whether or not h is larger than the number of dimensions s2. If h is larger than the number of dimensions s2, the process proceeds to step 210. If h is less than or equal to the number of dimensions s2, the process proceeds to step 204. Return and repeat the same process as above.

ステップ２１０では、現在の類似度Ｓｉｍ（Ｓ１、Ｓ２）をステップ２００で求めたＷで除算し、これを最終的な類似度Ｓｉｍ（Ｓ１、Ｓ２）とする。すなわち、Ｓｉｍ（Ｓ１、Ｓ２）＝Ｓｉｍ（Ｓ１、Ｓ２）／Ｗとする。 In step 210, the current similarity Sim (S1, S2) is divided by W obtained in step 200, and this is used as the final similarity Sim (S1, S2). That is, Sim (S1, S2) = Sim (S1, S2) / W.

このように、本ルーチンで求める類似度は、図８に示すルーチンで求める類似度をＷで除算する点だけが異なる。なお、上記（５）、（６）式で求める類似度Ｓｉｍ（Ｓ１、Ｓ２）は、その値が高いほど類似度が高いこととなる。 Thus, the similarity obtained in this routine differs only in that the similarity obtained in the routine shown in FIG. 8 is divided by W. The similarity Sim (S1, S2) obtained by the above equations (5) and (6) is higher as the value is higher.

次に、上記（７）式による第３の類似度計算について、図１０に示すフローチャートを参照して具体的に説明する。 Next, the third similarity calculation according to the equation (7) will be specifically described with reference to the flowchart shown in FIG.

ステップ３００では、テキスト文Ｓ１用の添え字ｋ、テキスト文Ｓ２用の添え字ｈ、ユークリッド距離計算用の変数Ｗを初期化する。すなわち、ｋ＝１、ｈ＝１、Ｗ＝０とする。 In step 300, a subscript k for the text sentence S1, a subscript h for the text sentence S2, and a variable W for Euclidean distance calculation are initialized. That is, k = 1, h = 1, and W = 0.

ステップ３０２では、テキスト文Ｓ１とテキスト文Ｓ２の単語ＩＤが同一か否かを判断する。すなわちＩＤ１_k＝ＩＤ２_hか否かを判断する。そして、単語ＩＤが同一の場合は、ステップ３０４へ移行し、単語ＩＤが同一でない場合には、ステップ３１０へ移行する。 In step 302, it is determined whether or not the word IDs of the text sentence S1 and the text sentence S2 are the same. That is, it is determined whether or not ID1 _k = ID2 _h . If the word IDs are the same, the process proceeds to step 304. If the word IDs are not the same, the process proceeds to step 310.

ステップ３０４では、ｋ、ｈをそれぞれインクリメントすると共に、現在の変数Ｗに重みｆ１_kとｆ２_hとの差を二乗した値を加算し、これを新たな変数Ｗとする。すなわち、ｋ＝ｋ＋１、ｈ＝ｈ＋１、Ｗ＝Ｗ＋（ｆ１_k−ｆ２_h）²とする。 In step 304, k and h are incremented, and a value obtained by squaring the difference between the weights f1 _k and f2 _h is added to the current variable W, and this is used as a new variable W. That is, k = k + 1, h = h + 1, and W = W + (f1 _k −f2 _h ) ² .

ステップ３０６では、ｋが次元数ｓ１より大きいか否か、ｈが次元数ｓ２より大きいか否かを判断し、ｋが次元数ｓ１より大きいか又はｈが次元数ｓ２より大きい場合には、ステップ３０８へ移行し、それ以外の場合には、ステップ３０２へ戻って上記と同様の処理を繰り返す。 In step 306, it is determined whether k is greater than the dimension number s1, and whether h is greater than the dimension number s2. If k is greater than the dimension number s1 or h is greater than the dimension number s2, step 306 is performed. If not, the process returns to step 302 to repeat the same processing as described above.

ステップ３１０では、テキスト文Ｓ１の単語ＩＤ１_kがテキスト文Ｓ２の単語ＩＤ２_hよりも小さいか否かを判断する。すなわちＩＤ１_k＜ＩＤ２_hの関係を満たすか否かを判断する。そして、ＩＤ１_k＜ＩＤ２_hの場合は、ステップ３１２へ移行し、そうでない場合、すなわちＩＤ１_k＞ＩＤ２_hの場合は、ステップ３１６へ移行する。 At step 310, it is determined whether a word ID1 _k text statement S1 is smaller than the word ID2 _h text sentence S2. That is, it is determined whether or not the relationship of ID1 _k <ID2 _h is satisfied. If ID1 _k <ID2 _h , the process proceeds to step 312; otherwise, that is, if ID1 _k > ID2 _h , the process proceeds to step 316.

ステップ３１２では、ｋをインクリメントすると共に、現在の変数Ｗに重みｆ１_kを二乗した値を加算し、これを新たな変数Ｗとする。すなわち、ｋ＝ｋ＋１、Ｗ＝Ｗ＋（ｆ１_k）²とする。 In step 312, k is incremented and a value _obtained by squaring the weight f1 _k is added to the current variable W, and this is set as a new variable W. That is, k = k + 1 and W = W + (f1 _k ) ² .

ステップ３１４では、ｋが次元数ｓ１よりも大きいか否かを判断し、ｋが次元数ｓ１よりも大きい場合にはステップ３１９へ移行し、ｋが次元数ｓ１以下の場合には、ステップ３０２へ戻って上記と同様の処理を繰り返す。 In step 314, it is determined whether or not k is larger than the number of dimensions s1, and if k is larger than the number of dimensions s1, the process proceeds to step 319. If k is less than the number of dimensions s1, the process proceeds to step 302. Return and repeat the same process as above.

ステップ３１９では、変数Ｗを次式により計算して、ステップ３０８へ移行する。 In step 319, the variable W is calculated by the following equation, and the process proceeds to step 308.

一方、ステップ３１６では、ｈをインクリメントすると共に、現在の変数Ｗに重みｆ２_hを二乗した値を加算し、これを新たな変数Ｗとする。すなわち、ｈ＝ｈ＋１、Ｗ＝Ｗ＋（ｆ２_h）²とする。 On the other hand, in step 316, the increments of h, by adding the value obtained by squaring the weight f2 _h the current variable W, make this a new variable W. That is, h = h + 1 and W = W + (f2 _h ) ² .

ステップ３１８では、ｈが次元数ｓ２よりも大きいか否かを判断し、ｈが次元数ｓ２よりも大きい場合にはステップ３２０へ移行し、ｈが次元数ｓ２以下の場合には、ステップ３０２へ戻って上記と同様の処理を繰り返す。 In step 318, it is determined whether h is larger than the dimension number s2. If h is larger than the dimension number s2, the process proceeds to step 320. If h is less than the dimension number s2, the process proceeds to step 302. Return and repeat the same process as above.

ステップ３２０では、変数Ｗを次式により計算して、ステップ３０８へ移行する。 In step 320, the variable W is calculated by the following equation, and the process proceeds to step 308.

ステップ３０８では、現在のＷの平方根を計算し、これを最終的な類似尺度Ｄｉｓｔ（Ｓ１、Ｓ２）とする。 In step 308, the current square root of W is calculated, and this is used as the final similarity measure Dist (S1, S2).

このように、本ルーチンでは、重みｆ１_kとｆ２_hとのユークリッド距離を類似尺度として計算する。すなわち、類似尺度Ｄｉｓｔ（Ｓ１、Ｓ２）は、その値が高いほど（距離が遠ほど）類似尺度が低いこととなる。この第３の類似尺度計算方法は、上記（３）、（４）式により意味ベクトル特徴量を生成する場合に効果的であると考えられる。前述したように、上記（３）、（４）式では、（２）式と比較して意味内容の重要度がより強く反映されるため、これがユークリッド距離の計算にも強く反映され、意味内容の違いが類似度の違いに大きく反映されるためである。 Thus, in this routine, the Euclidean distance between the weights f1 _k and f2 _h is calculated as a similarity measure. That is, the similarity scale Dist (S1, S2) is lower as the value is higher (the distance is longer). This third similarity measure calculation method is considered to be effective in the case where the semantic vector feature value is generated by the above equations (3) and (4). As described above, in the above expressions (3) and (4), the importance of the meaning content is more strongly reflected than in the expression (2), so this is also strongly reflected in the calculation of the Euclidean distance. This is because the difference is greatly reflected in the difference in similarity.

次に、テキスト文検索装置１０により以下に示すテキスト文Ａとテキスト文Ｂとの類似度を求めた結果について説明する。 Next, a description will be given of a result of obtaining the similarity between the text sentence A and the text sentence B shown below by the text sentence search device 10.

上記のようなテキスト文Ａ、テキスト文Ｂは、共に「女子４００メートル決勝戦で、オーストラリアの名将フェリマンは金メダルを獲得した。」という内容が重要であり、意味内容として類似した文章であるといえる。 Text Sentence A and Text Sentence B as described above both have the important content of “The Women's 400-meter Final, Australia's famous general Ferriman has won a gold medal.” .

まず、テキスト文Ａとテキスト文Ｂを単語切り分け・品詞付与部１４及び構文・意味解析部１６で解析した結果は以下のようになった。 First, the results of analyzing the text sentence A and the text sentence B by the word segmentation / part-of-speech adding unit 14 and the syntax / semantic analysis unit 16 are as follows.

また、特徴量生成部１８により、上記（５）式を用いてテキスト文Ａの意味ベクトル特徴量Ａｖ（ＩＤａ１、ｆａ１、ＩＤａ２、ｆａ２、…、ＩＤａｓ、ｆａｓ）を求めた結果として、各次元の単語ａｉと、その重みｆａｉを図１１に示した。 Further, as a result of obtaining the semantic vector feature amount Av (IDa1, fa1, IDa2, fa2,..., IDas, fas) of the text sentence A by using the above equation (5) by the feature amount generation unit 18, The word ai and its weight fai are shown in FIG.

同様に、特徴量生成部１８により、上記（５）式を用いてテキスト文Ｂの意味ベクトル特徴量Ｂｖ（ＩＤｂ１、ｆｂ１、ＩＤｂ２、ｆｂ２、…、ＩＤｂｓ、ｆｂｓ）を求めた結果として、各次元の単語ｂｉと、その重みｆｂｉを図１２に示した。なお、上記（５）式の定数ａ、ｂ、ｃは、それぞれａ＝０、ｂ＝０、ｃ＝０とした。 Similarly, as a result of obtaining the semantic vector feature quantity Bv (IDb1, fb1, IDb2, fb2,..., IDbs, fbs) of the text sentence B using the above equation (5) by the feature quantity generation unit 18, each dimension is obtained. The word bi and its weight fbi are shown in FIG. The constants a, b, and c in the above formula (5) were set to a = 0, b = 0, and c = 0, respectively.

そして、類似度計算部２２により上記（３）式を用いて、テキスト文Ａの意味ベクトル特徴量及びテキスト文Ｂの意味ベクトル特徴量から類似度Ｓｉｍ（Ａｖ、Ｂｖ）を計算した結果、Ｓｉｍ（Ａｖ、Ｂｖ）＝０．７６９となった。これに対して、従来のベクトル空間法で類似度を求めたところ、Ｓｉｍ（Ａｖ、Ｂｖ）＝０．６３１となった。 Then, the similarity calculation unit 22 calculates the similarity Sim (Av, Bv) from the semantic vector feature amount of the text sentence A and the semantic vector feature amount of the text sentence B using the above equation (3). Av, Bv) = 0.769. On the other hand, when the similarity was obtained by the conventional vector space method, Sim (Av, Bv) = 0.611.

このように、本発明に係る方法によって算出した類似度は、従来のベクトル空間法により算出した類似度と比較して大きな値となった。これは、本発明に係る方法が、意味内容が類似するテキスト文Ａとテキスト文Ｂとの類似度を適切に計算できることを示しており、従来のベクトル空間法と比較して精度が高いといえる。 As described above, the similarity calculated by the method according to the present invention is larger than the similarity calculated by the conventional vector space method. This indicates that the method according to the present invention can appropriately calculate the similarity between the text sentence A and the text sentence B having similar semantic contents, and can be said to have higher accuracy than the conventional vector space method. .

なお、本実施形態では、中国語のテキスト文の類似度を計算する場合に本発明を適用した場合について説明したが、他の言語、例えば日本語や英語等の他の言語にも本発明を適用可能である。 In this embodiment, the case where the present invention is applied when calculating the similarity of a Chinese text sentence has been described. However, the present invention is applied to other languages such as Japanese and English. Applicable.

テキスト文検索装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a text sentence search apparatus. テキスト文検索装置が適用される装置の一例を示すブロック図である。It is a block diagram which shows an example of the apparatus with which a text sentence search apparatus is applied. 単語切り分け・品詞付与部によるテキスト文の解析結果を示す図である。It is a figure which shows the analysis result of the text sentence by a word segmentation / part of speech provision part. 構文・意味解析部によるテキスト文の解析結果を示す図である。It is a figure which shows the analysis result of the text sentence by a syntax and a semantic analysis part. 格のクラスの分類を示す図である。It is a figure which shows classification of a case class. 格のクラスと意味チャンク重みとの関係を示す図である。It is a figure which shows the relationship between a case class and a semantic chunk weight. 格のクラスと中心単語重みとの関係を示す図である。It is a figure which shows the relationship between a case class and a center word weight. 第１の類似度計算の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of 1st similarity calculation. 第２の類似度計算の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of 2nd similarity calculation. 第３の類似度計算の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of 3rd similarity calculation. テキスト文Ａの意味ベクトル特徴量の各次元の単語及び重みを示す図である。It is a figure which shows the word and weight of each dimension of the semantic vector feature-value of the text sentence A. テキスト文Ｂの意味ベクトル特徴量の各次元の単語及び重みを示す図である。It is a figure which shows the word and weight of each dimension of the semantic vector feature-value of the text sentence B. FIG.

Explanation of symbols

１０テキスト文検索装置
１２外部記憶装置
１４単語切り分け・品詞付与部
１６構文・意味解析部
１８特徴量生成部
２０統計データ記憶部
２２類似度計算部
２４データベース記憶部
２６記憶部
２８、３０、３２、３４メモリ DESCRIPTION OF SYMBOLS 10 Text sentence search device 12 External storage device 14 Word segmentation / part-of-speech assignment unit 16 Syntax / semantic analysis unit 18 Feature quantity generation unit 20 Statistical data storage unit 22 Similarity calculation unit 24 Database storage unit 26 Storage units 28, 30, 32, 34 memory

Claims

The question text sentence is segmented into words, part of speech is given to the segmented words, word segmentation / part of speech provision means for generating word segmentation / part of speech provision data,
A syntax / semantic analysis means for analyzing a semantic chunk included in the question text sentence, a central word of the semantic chunk, and a case of the semantic chunk, based on the word segmentation / part of speech assignment data,
Appearance frequency data storage means in which appearance frequency data representing the frequency of appearance in the search target text sentence set for each word appearing in the search target text sentence set is stored;
Setting means for setting at least one of a semantic chunk weight and a central word weight of a word included in the semantic chunk based on the case of the semantic chunk;
By calculating the weight of each word included in the question text sentence based on the appearance frequency data for the word included in the question text sentence and at least one of the semantic chunk weight and the central word weight, respectively. Feature quantity generating means for generating a semantic vector feature quantity of the question text sentence;
Semantic vector feature value storage means for storing in advance the semantic vector feature value of each text sentence in the set of text sentences to be searched;
Similarity calculating means for calculating the similarity between the question text sentence and the text sentence to be searched based on the meaning vector feature quantity of the question text sentence and the meaning vector feature quantity of the text sentence to be searched;
A text sentence search device characterized by comprising:

The feature quantity generation means uses a weight of a word included in the question text sentence as a first parameter based on a frequency with which the word appears in the question text sentence, and at least one of the semantic chunk weight and the central word weight. The text sentence search apparatus according to claim 1, wherein the value is calculated by multiplying an addition value obtained by adding a second parameter based on a frequency of occurrence of the word in the text sentence set to be searched.

The feature amount generation means uses a weight of a word included in the question text sentence as a first parameter based on a frequency with which the word appears in the question text sentence, and the word appears in the set of text sentences to be searched. The text sentence search according to claim 1, wherein a multiplication value obtained by multiplying a second parameter based on a frequency of performing the calculation is multiplied by a coefficient including at least one of the semantic chunk weight and the central word weight. apparatus.

The similarity calculation means searches for a matching word among words included in the question text sentence and a word included in the text sentence to be searched, and matches the values obtained by multiplying the weights of the searched words. The text sentence search apparatus according to claim 1, wherein the similarity is calculated by adding all the words that have been added.

The similarity calculation means searches for a matching word among words included in the question text sentence and a word included in the search target text sentence, calculates a distance between weights of the searched words, The text sentence search apparatus according to claim 1, wherein the similarity is calculated by adding the calculated distances to all the matched words.

The question text is segmented into words, part of speech is given to the segmented words, and word segmentation / part of speech data is generated,
Based on the word segmentation / part-of-speech assignment data, the semantic chunk included in the question text sentence, the central word of the semantic chunk, and the case of the semantic chunk are analyzed,
Based on the case of the semantic chunk, at least one of a semantic chunk weight and a central word weight of a word included in the semantic chunk is set,
Of the appearance frequency data representing the frequency of appearance in the search target text sentence set for each word appearing in the search target text sentence set, the appearance frequency data for the words included in the question text sentence, Generating a semantic vector feature amount of the question text sentence by calculating a weight of each word included in the question text sentence based on at least one of the semantic chunk weight and the central word weight;
Calculating a similarity between the question text sentence and the text sentence to be searched based on the semantic vector feature quantity of the question text sentence and the semantic vector feature quantity of the text sentence to be searched;
The text sentence search method characterized by this.

Cutting the question text sentence into words, adding part of speech to the cut word, and generating word separation / part of speech data;
Analyzing the semantic chunk included in the question text sentence, the central word of the semantic chunk, and the case of the semantic chunk based on the word segmentation / part of speech assignment data;
Setting at least one of a semantic chunk weight and a central word weight of a word included in the semantic chunk based on the case of the semantic chunk;
Of the appearance frequency data representing the frequency of appearance in the search target text sentence set for each word appearing in the search target text sentence set, the appearance frequency data for the words included in the question text sentence, Generating a semantic vector feature quantity of the question text sentence by calculating a weight of each word included in the question text sentence based on at least one of a semantic chunk weight and the central word weight;
Calculating the similarity between the question text sentence and the search target text sentence based on the semantic vector feature quantity of the question text sentence and the semantic vector feature quantity of the text sentence to be searched;
A text sentence search program for causing a computer to execute a process including:

A semantic analysis unit that performs language semantic analysis of the question text sentence and sets the semantic importance of each word of the question text sentence;
A feature amount generation unit that generates a feature amount of the question text sentence using an analysis result of the language semantic analysis and a vector space method;
A similarity calculation unit that calculates the similarity between the feature amount of the question text sentence generated by the feature amount generation unit and the feature amount of the search target text sentence;
A search result extraction unit that extracts a text sentence as a search result from a set of search target text sentences based on the calculation result of the similarity;
A text sentence search device comprising:

9. The text sentence retrieval apparatus according to claim 8, wherein the semantic analysis unit performs language semantic analysis based on case grammar.

The text sentence search apparatus according to claim 8 or 9, wherein the similarity calculation unit calculates the similarity using at least one of an inner product, a cosine function, and a Euclidean distance.

11. The text sentence search device according to claim 8, wherein the feature quantity generation unit calculates a feature quantity obtained by using the vector space method by the following equation.
TFi x log (N / ni + c)
Here, TFi is the number of times the word i appears in the text sentence, N is the total number of text sentences in the search target text sentence set, ni is the total number of text sentences including the word i in the search target text sentence set, c is a constant, c ≧ 0.01 is set.

The semantic analysis unit extracts each semantic case in the text sentence and, based on the importance of the extracted semantic case, assigns a different weight to the semantic case, and plays a central role in the text sentence. 12. The method according to claim 9, further comprising at least one of a central word analysis unit that extracts a responsible word and assigns a different weight to the central word based on the importance of the extracted central word. The text sentence search device according to any one of claims.

The feature quantity generation unit is obtained using the vector space method according to the following equation, using the semantic case weight obtained by the semantic case analysis unit and the central word weight obtained by the central word analysis unit. 13. The text sentence search apparatus according to claim 12, wherein the feature quantity to be corrected is corrected.
(TFi + Chunk_Weight + Head_Weight) × log (N / ni + c)
Where TFi is the number of times the word i appears in the text sentence, Chunk_Weight is the semantic weight to which the word i belongs, Head_Weight is the central word weight (used when the word i is the central word), and N is the search target text sentence The total number of text sentences in the set, ni is the total number of text sentences including the word i in the search target text sentence set, c is a constant, and c ≧ 0.01.

13. The feature quantity generation unit corrects a feature quantity obtained by using the vector space method according to the following equation, using the semantic case weight obtained by the semantic case analysis unit. Text sentence search device.
(TFi + Chunk_Weight) × log (N / ni + c)
Here, TFi is the number of times the word i appears in the text sentence, Chunk_Weight is the semantic weight to which the word i belongs, N is the total number of text sentences in the search target text sentence set, and ni is the word i in the search target text sentence set. , C is a constant, and c ≧ 0.01 is set.

The feature quantity generation unit corrects a feature quantity obtained by using the vector space method according to the following equation, using the central word weight obtained by the central word analysis unit. Text sentence search device.
(TFi + Head_Weight) × log (N / ni + c)
Where TFi is the number of times the word i appears in the text sentence, Head_Weight is the central word weight (used when the word i is the central word), N is the total number of text sentences in the search target text sentence set, and ni is the search target text The total number of text sentences including the word i in the sentence set, c is a constant, and c ≧ 0.01.

The feature quantity generation unit is obtained using the vector space method according to the following equation, using the semantic case weight obtained by the semantic case analysis unit and the central word weight obtained by the central word analysis unit. 13. The text sentence search apparatus according to claim 12, wherein the feature quantity to be corrected is corrected.
(TFi × log (N / ni + c)) × (Chunk_Weight + Head_Weight) / 2
Where TFi is the number of times the word i appears in the text sentence, Chunk_Weight is the semantic weight to which the word i belongs, Head_Weight is the central word weight (used when the word i is the central word), and N is the search target text sentence The total number of text sentences in the set, ni is the total number of text sentences including the word i in the search target text sentence set, c is a constant, and c ≧ 0.01.

13. The feature quantity generation unit corrects a feature quantity obtained by using the vector space method according to the following equation, using the semantic case weight obtained by the semantic case analysis unit. Text sentence search device.
TFi x log (N / ni + c) x Chunk_Weight
Here, TFi is the number of times the word i appears in the text sentence, Chunk_Weight is the semantic weight to which the word i belongs, N is the total number of text sentences in the search target text sentence set, and ni is the word i in the search target text sentence set. , C is a constant, and c ≧ 0.01 is set.

The feature quantity generation unit corrects a feature quantity obtained by using the vector space method according to the following equation, using the central word weight obtained by the central word analysis unit. Text sentence search device.
TFi × log (N / ni + c) × Head_Weight
Where TFi is the number of times the word i appears in the text sentence, Head_Weight is the central word weight (used when the word i is the central word), N is the total number of text sentences in the search target text sentence set, and ni is the search target text The total number of text sentences including the word i in the sentence set, c is a constant, and c ≧ 0.01.