JP4252038B2

JP4252038B2 - Paraphrase expression acquisition system, paraphrase expression acquisition method, and paraphrase expression acquisition program

Info

Publication number: JP4252038B2
Application number: JP2005002366A
Authority: JP
Inventors: 隆明長谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-01-07
Filing date: 2005-01-07
Publication date: 2009-04-08
Anticipated expiration: 2025-01-07
Also published as: JP2006190146A

Description

本発明は、同じ意味内容を異なる表現で言い表す言い換え表現を、多数の文書よりなる文書集合から収集する技術に関わり、特に任意の２つの単語または単語列の間に存在する特定の関係を表す言い換え表現を獲得する技術に関する。 The present invention relates to a technique for collecting paraphrased expressions that express the same semantic content in different expressions from a document set consisting of a large number of documents, and in particular, paraphrases that express a specific relationship existing between any two words or word strings. It is related to technology to acquire expression.

ハードウェアの進歩により大規模な文書集合を扱うことが可能になり、人手により構築された言い換え表現のための知識に頼らず、文書集合から機械的に言い換え表現を獲得しようとする試みが提案されてきた。 Advances in hardware have made it possible to handle large document sets, and attempts have been made to acquire paraphrase expressions mechanically from document sets without relying on knowledge of paraphrase expressions constructed manually. I came.

機械的に言い換え表現を文書集合から獲得する方法として、同じ日の同じ出来事を伝える二つのコンパラブルなコーパスを用いて、対応が付けられた文同士を構文解析し、その出来事についてのキーとなる単語を手がかりとして言い換え表現を抽出する方法が提案されている（非特許文献１参照）。 As a method of mechanically acquiring paraphrasing expressions from a set of documents, two comparable corpus that convey the same event on the same day are used to parse the corresponding sentences, and the key word for the event A method of extracting a paraphrase expression using as a clue has been proposed (see Non-Patent Document 1).

また、文書集合を構文解析して得られたそれぞれの文の構造から動詞とその主語と目的語を得て、文書集合全体を対象として各々の動詞の持つ主語と目的語を収集し、任意の動詞間の主語と目的語についての相互情報量を計算することにより、類似した動詞を発見し、これを言い換え表現とする方法も提案されている（非特許文献２参照）。 In addition, the verb, the subject, and the object are obtained from the structure of each sentence obtained by parsing the document set, and the subject and object of each verb are collected for the entire document set. A method has also been proposed in which a similar verb is found by calculating the mutual information about the subject and object between verbs, and this is used as a paraphrase expression (see Non-Patent Document 2).

一方、特定の関係に限定した言い換え表現を獲得するために、収集したい言い換え表現が表す特定の関係にある既知の事例を用いて、文書集合からその関係を表す表現を収集する方法も提案されている（非特許文献３参照）。この方法では構文解析は行われず、指定された事例が多くの文書で共通して出現する表現を抜き出し、その表現が含まれる文を収集した後にその事例だけが高頻度で出現するものを選択することによって言い換え表現を獲得している。
関根聡「複数の新聞を使用した言い換え表現の自動抽出」、言語処理学会第７回年次大会ワークショップ論文集、２００１、Ｐ９〜１４ D. Lin and P. Pantel, "DIRT-Discovery of Inference Rules from Text", Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp.323-328 D. Ravichandran and E. Hovy, "Learning Surface Text Patterns for a Question Answering system", Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, 2002, pp.41-47 On the other hand, in order to acquire a paraphrase expression limited to a specific relationship, a method of collecting an expression representing the relationship from a document set using a known case having a specific relationship represented by the paraphrase expression to be collected has been proposed. (See Non-Patent Document 3). In this method, no parsing is performed, and an expression in which a specified case appears in common in many documents is extracted, and after collecting sentences that include the expression, only those cases that occur frequently are selected. In other words, we have acquired paraphrased expressions.
Satoshi Sekine “Automatic Extraction of Paraphrasing Expressions Using Multiple Newspapers”, Proc. Of the 7th Annual Conference of the Association for Natural Language Processing, 2001, P9-14 D. Lin and P. Pantel, "DIRT-Discovery of Inference Rules from Text", Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp.323-328 D. Ravichandran and E. Hovy, "Learning Surface Text Patterns for a Question Answering system", Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, 2002, pp.41-47

従来の技術における構文解析を用いる方法では、構文解析の精度が十分に高くないとその後の言い換え表現の獲得の精度が大きく低下するという問題が存在する。また、構文解析を用いない方法では、獲得したい言い換え表現が表す特定の関係にある事例を事前に与える必要があり、どういう事例を与えるかによって結果が大きく左右されるという問題がある。 In the conventional method using syntax analysis, there is a problem that accuracy of subsequent paraphrase expression acquisition is greatly reduced unless the accuracy of syntax analysis is sufficiently high. Further, in the method not using the parsing, there is a problem that it is necessary to give in advance a case having a specific relationship represented by the paraphrase expression to be obtained, and the result greatly depends on what kind of case is given.

本発明は、このような問題を解決するため、構文解析を必要とせず、また予め特定の関係にある事例を与えることなく、文書集合全体から収集した特定の属性を持つ任意の２つの共起する単語または単語列の出現する文脈の集合に基づくクラスタリングにより、共起する単語または単語列の間にある特定の関係を発見し、発見された特定の関係を表す文脈だけを選択することによって、言い換え表現の獲得を可能とすることを目的とする。 The present invention solves such a problem by eliminating any two co-occurrence with specific attributes collected from the entire document set without requiring parsing and without giving a specific relationship in advance. By clustering based on the set of contexts in which words or word strings appear, find specific relationships between co-occurring words or word strings, and select only the contexts that represent the specific relationships found, The purpose is to enable acquisition of paraphrased expressions.

上記の課題を解決するための本発明の言い換え表現獲得システムは、
単語または単語列に特定の属性を表すタグが付されている文書を多数格納した文書集合データベースと、
共起単語対毎の個々の文脈を少なくともその頻度とともに格納する共起単語対文脈データベースと、
単語毎の文書頻度を格納する文書頻度データベースと、
共起単語対毎の文脈ベクトルを格納する文脈ベクトルデータベースと、
文脈ベクトル間の類似度を格納する文脈ベクトル類似度データベースと、
クラスタ毎の共起単語対を格納する共起単語対クラスタデータベースと、
文書集合データベースから文書集合を入力する文書集合入力部と、
入力された文書から特定の属性を表すタグが付されている任意の２つの単語または単語列が共起して出現する文脈を収集し、これを全ての文書に対して行い、前記任意の２つの単語または単語列の組み合わせからなる共起単語対毎に個々の文脈をその頻度とともに共起単語対文脈データベースに格納する共起単語対文脈収集部と、
入力された文書を単語に分解し、単語毎に当該単語を含む文書数をカウントし、これを全ての文書に対して行い、単語を含む文書数の全文書数に対する割合である文書頻度を全ての単語について計算し、これを文書頻度データベースに格納する文書頻度計算部と、
共起単語対文脈データベースから一の共起単語対に対応する個々の文脈とその頻度を読み出し、各文脈を単語に分割し、各単語毎に当該単語を含む文脈の頻度の総和を求めて各単語の単語頻度とし、当該各単語の文書頻度を文書頻度データベースから読み出し、両者から各単語の重みを計算し、前記一の共起単語対に対応する個々の文脈を構成する単語とその重みからなる文脈ベクトルを生成し、文脈ベクトルデータベースに格納し、これを全ての共起単語対に対して行う文脈ベクトル生成部と、
文脈ベクトルデータベースから２つの共起単語対に対応する文脈ベクトルを読み出し、その間の類似度を計算し、これを２つの共起単語対に対応する文脈ベクトルの全ての組み合わせについて行い、文脈ベクトル類似度データベースに格納する文脈ベクトル類似度計算部と、
文脈ベクトル類似度データベースから文脈ベクトル間の全ての類似度を読み出し、文脈ベクトル間の類似度が近い共起単語対を含むクラスタを形成し、各クラスタに含まれる共起単語対を共起単語対クラスタデータベースに格納する共起単語対クラスタリング部と、
共起単語対クラスタデータベースから一のクラスタに含まれる個々の共起単語対を読み出し、文脈ベクトルデータベースから前記個々の共起単語対に対応する文脈を構成する単語を読み出し、多くの共起単語対の文脈間で共通する単語を前記一のクラスタに含まれる個々の共起単語対の関係を表すラベルとして獲得し、共起単語対クラスタデータベースの対応するクラスタに格納し、これを全てのクラスタについて行う関係ラベル獲得部と、
共起単語対クラスタデータベース及び共起単語対文脈データベースから各クラスタに含まれる個々の共起単語対に共通する文脈や関係を表すラベルとしての単語を含む文脈を言い換え表現として獲得するクラスタ内文脈選択部と、
得られたクラスタやクラスタにおける関係を表すラベルとしての単語や言い換え表現を出力する出力部とを備えている。 The paraphrase expression acquisition system of the present invention for solving the above problem is
A document collection database that stores a large number of documents tagged with specific attributes in words or word strings,
A co-occurrence word pair context database that stores the individual context for each co-occurrence word pair along with at least its frequency;
A document frequency database that stores the document frequency for each word;
A context vector database storing a context vector for each co-occurrence word pair;
A context vector similarity database for storing similarity between context vectors;
A co-occurrence word pair cluster database storing co-occurrence word pairs for each cluster;
A document set input unit for inputting a document set from the document set database;
Collects contexts in which any two words or word strings tagged with a specific attribute from the input document appear together, and this is performed for all the documents. A co-occurrence word-to-context collection unit that stores an individual context together with its frequency in a co-occurrence word-to-context database for each co-occurrence word pair consisting of a combination of two words or word strings;
The input document is decomposed into words, the number of documents including the word is counted for each word, this is performed for all documents, and the document frequency that is the ratio of the number of documents including words to the total number of documents is all A document frequency calculation unit that calculates a word of the word and stores it in a document frequency database;
Each context and its frequency corresponding to one co-occurrence word pair is read from the co-occurrence word pair context database, each context is divided into words, and the sum of the frequency of the context including the word is obtained for each word. The word frequency of the word, the document frequency of each word is read from the document frequency database, the weight of each word is calculated from both, and the word constituting the individual context corresponding to the one co-occurrence word pair and its weight Generating a context vector, storing it in a context vector database, and performing this for all co-occurrence word pairs;
A context vector corresponding to two co-occurrence word pairs is read from the context vector database, a similarity between them is calculated, and this is performed for all combinations of context vectors corresponding to the two co-occurrence word pairs. A context vector similarity calculator stored in the database;
Read all similarities between context vectors from the context vector similarity database, form a cluster containing co-occurrence word pairs with similar similarity between context vectors, and co-occurrence word pairs included in each cluster A co-occurrence word pair clustering unit to be stored in the cluster database;
The individual co-occurrence word pairs included in one cluster are read out from the co-occurrence word pair cluster database, and the words constituting the context corresponding to the individual co-occurrence word pairs are read out from the context vector database. A common word between the contexts is acquired as a label representing the relationship between the individual co-occurrence word pairs included in the one cluster, and is stored in the corresponding cluster of the co-occurrence word pair cluster database. A relationship label acquisition unit to perform,
Intra-cluster context selection that acquires words containing words as labels representing contexts and relationships common to individual co-occurrence word pairs contained in each cluster from the co-occurrence word pair cluster database and the co-occurrence word pair context database And
And an output unit that outputs words and paraphrased expressions as labels representing the obtained clusters and relationships in the clusters.

本発明の言い換え表現獲得システムによれば、同一内容を表す文書の対を集めることなく、また関係や言い換え表現についての知識を事前に与えることなく、大規模なコーパス（文書集合）から類似した文脈を持つ単語の対のクラスタリングにより、同じ関係を持つ単語対のクラスタを得ることができ、各クラスタ内の文脈や単語の共通性に基づいて、クラスタが持つ関係に特有な文脈だけを選択することにより、言い換え表現を獲得することが可能となる。 According to the paraphrase expression acquiring system of the present invention, a similar context can be obtained from a large corpus (document set) without collecting pairs of documents representing the same content and without giving knowledge about relations and paraphrase expressions in advance. Cluster of pairs of words that have the same relationship can be obtained, and only contexts specific to the relationship of the clusters are selected based on the context within each cluster and the commonality of the words This makes it possible to acquire a paraphrase expression.

図１は本発明の言い換え表現獲得システムの実施の形態の一例を示すもので、図中、１は文書集合データベース（文書集合ＤＢ）、２は共起単語対文脈データベース（共起単語対文脈ＤＢ）、３は文書頻度データベース（文書頻度ＤＢ）、４は文脈ベクトルデータベース（文脈ベクトルＤＢ）、５は文脈ベクトル類似度データベース（文脈ベクトル類似度ＤＢ）、６は共起単語対クラスタデータベース（共起単語対クラスタＤＢ）、１１は文書集合入力部、１２は共起単語対文脈収集部、１３は文書頻度計算部、１４は文脈ベクトル生成部、１５は文脈ベクトル類似度計算部、１６は共起単語対クラスタリング部、１７は関係ラベル獲得部、１８はクラスタ内文脈選択部、１９は出力部である。 FIG. 1 shows an example of an embodiment of the paraphrase expression acquisition system of the present invention. In the figure, 1 is a document set database (document set DB), 2 is a co-occurrence word pair context database (co-occurrence word pair context DB). ), 3 is a document frequency database (document frequency DB), 4 is a context vector database (context vector DB), 5 is a context vector similarity database (context vector similarity DB), 6 is a co-occurrence word pair cluster database (co-occurrence) Word pair cluster DB), 11 is a document set input unit, 12 is a co-occurrence word pair context collection unit, 13 is a document frequency calculation unit, 14 is a context vector generation unit, 15 is a context vector similarity calculation unit, and 16 is co-occurrence A word pair clustering unit, 17 is a relation label acquisition unit, 18 is an in-cluster context selection unit, and 19 is an output unit.

文書集合データベース１は、単語または単語列に特定の属性を表すタグが付されている文書を多数格納している。共起単語対文脈データベース２は、共起単語対毎の個々の文脈を少なくともその頻度とともに格納する。文書頻度データベース３は、単語毎の文書頻度を格納する。文脈ベクトルデータベース４は、共起単語対毎の文脈ベクトルを格納する文脈ベクトル類似度データベース５は、文脈ベクトル間の類似度を格納する。共起単語対クラスタデータベース６は、クラスタ毎の共起単語対を格納する。 The document set database 1 stores a large number of documents in which a tag representing a specific attribute is attached to a word or a word string. The co-occurrence word pair context database 2 stores individual contexts for each co-occurrence word pair together with at least the frequency thereof. The document frequency database 3 stores the document frequency for each word. The context vector database 4 stores the context vector for each co-occurrence word pair, and the context vector similarity database 5 stores the similarity between the context vectors. The co-occurrence word pair cluster database 6 stores co-occurrence word pairs for each cluster.

文書集合入力部１１は、文書集合データベース１から文書集合を入力する。共起単語対文脈収集部１２は、入力された文書から特定の属性を表すタグが付されている任意の２つの単語または単語列が共起して出現する文脈を収集し、これを全ての文書に対して行い、前記任意の２つの単語または単語列の組み合わせからなる共起単語対毎に個々の文脈をその頻度とともに共起単語対文脈データベース２に格納する。文書頻度計算部１３は、入力された文書を単語に分解し、単語毎に当該単語を含む文書数をカウントし、これを全ての文書に対して行い、単語を含む文書数の全文書数に対する割合である文書頻度を全ての単語について計算し、これを文書頻度データベース３に格納する。 The document set input unit 11 inputs a document set from the document set database 1. The co-occurrence word-to-context collection unit 12 collects contexts in which any two words or word strings to which a tag indicating a specific attribute is attached co-occurs from the input document, For each co-occurrence word pair composed of a combination of any two words or word strings, the individual contexts are stored in the co-occurrence word pair context database 2 together with their frequencies. The document frequency calculation unit 13 decomposes the input document into words, counts the number of documents including the word for each word, performs this on all the documents, and performs the process on the total number of documents including the word. The document frequency, which is a ratio, is calculated for all words and stored in the document frequency database 3.

文脈ベクトル生成部１４は、共起単語対文脈データベース２から一の共起単語対に対応する個々の文脈とその頻度を読み出し、各文脈を単語に分割し、各単語毎に当該単語を含む文脈の頻度の総和を求めて各単語の単語頻度とし、当該各単語の文書頻度を文書頻度データベース３から読み出し、両者から各単語の重みを計算し、前記一の共起単語対に対応する個々の文脈を構成する単語とその重みからなる文脈ベクトルを生成し、文脈ベクトルデータベース４に格納し、これを全ての共起単語対に対して行う。 The context vector generation unit 14 reads each context corresponding to one co-occurrence word pair and its frequency from the co-occurrence word pair context database 2, divides each context into words, and a context including the word for each word Is calculated as the word frequency of each word, the document frequency of each word is read from the document frequency database 3, the weight of each word is calculated from both, and the individual co-occurrence word pair A context vector composed of the words constituting the context and their weights is generated, stored in the context vector database 4, and this is performed for all co-occurrence word pairs.

文脈ベクトル類似度計算部１５は、文脈ベクトルデータベース４から２つの共起単語対に対応する文脈ベクトルを読み出し、その間の類似度を計算し、これを２つの共起単語対に対応する文脈ベクトルの全ての組み合わせについて行い、文脈ベクトル類似度データベース５に格納する。共起単語対クラスタリング部１６は、文脈ベクトル類似度データベース５から文脈ベクトル間の全ての類似度を読み出し、文脈ベクトル間の類似度が近い共起単語対を含むクラスタを形成（クラスタリング）し、各クラスタに含まれる共起単語対を共起単語対クラスタデータベース６に格納する。 The context vector similarity calculation unit 15 reads the context vectors corresponding to the two co-occurrence word pairs from the context vector database 4, calculates the similarity between the two, and calculates the similarity between the context vector corresponding to the two co-occurrence word pairs. All combinations are performed and stored in the context vector similarity database 5. The co-occurrence word pair clustering unit 16 reads all similarities between context vectors from the context vector similarity database 5, forms clusters (clustering) including co-occurrence word pairs having similar similarity between context vectors, Co-occurrence word pairs included in the cluster are stored in the co-occurrence word pair cluster database 6.

関係ラベル獲得部１７は、共起単語対クラスタデータベース６から一のクラスタに含まれる個々の共起単語対を読み出し、文脈ベクトルデータベース４から前記個々の共起単語対に対応する文脈を構成する単語を読み出し、多くの共起単語対間で共通する単語を前記一のクラスタに含まれる個々の共起単語対の関係を表すラベルとして獲得し、共起単語対クラスタデータベースの対応するクラスタに格納し、これを全てのクラスタについて行う。 The relation label acquisition unit 17 reads out individual co-occurrence word pairs included in one cluster from the co-occurrence word pair cluster database 6, and forms words corresponding to the individual co-occurrence word pairs from the context vector database 4. Is obtained as a label representing the relationship between individual co-occurrence word pairs included in the one cluster and stored in the corresponding cluster of the co-occurrence word pair cluster database. This is done for all clusters.

クラスタ内文脈選択部１８は、共起単語対クラスタデータベース６及び共起単語対文脈データベース２から各クラスタに含まれる個々の共起単語対に共通する文脈や関係を表すラベルとしての単語を含む文脈を言い換え表現として獲得する。出力部１９は、得られたクラスタやクラスタにおける関係を表すラベルとしての単語や言い換え表現を出力する。 The in-cluster context selection unit 18 includes contexts including words as labels representing contexts and relationships common to the individual co-occurrence word pairs included in each cluster from the co-occurrence word pair cluster database 6 and the co-occurrence word pair context database 2. Is obtained as a paraphrase expression. The output unit 19 outputs a word or paraphrase expression as a label representing the obtained cluster or a relationship in the cluster.

前述した言い換え表現獲得システムは、前記各データベースを備えたコンピュータ（ハードウェア）と、これらと協働して各種機能を実現させるプログラム（ソフトウェア）とによっても実現可能であり、このプログラムに対応する処理の流れの一例を図２に示す。 The paraphrase expression acquisition system described above can also be realized by a computer (hardware) provided with each of the databases and a program (software) that realizes various functions in cooperation with these databases, and processing corresponding to this program An example of the flow is shown in FIG.

以下、具体例を用いて、本発明の言い換え表現獲得システムの詳細な構成をその動作とともに説明する。 Hereinafter, the detailed configuration of the paraphrase expression acquisition system of the present invention will be described together with its operation using a specific example.

ここでは、例えば文書集合データベース１には特定の属性として人名や地名等の固有名詞を表すタグが付されている文書が大量に格納され文書集合をなしているとし、該文書集合から固有名詞の間の関係とそれを表す言い換え表現を獲得する動作について説明する。 Here, for example, it is assumed that the document set database 1 stores a large number of documents with tags representing proper nouns such as personal names and place names as specific attributes to form a document set. The operation for acquiring the relationship between the relations and the paraphrased expression representing it will be described.

文書集合入力部１１は、文書集合データベース１に格納されている文書集合を逐次取り出す（ｓ１）。 The document set input unit 11 sequentially takes out the document sets stored in the document set database 1 (s1).

共起単語対文脈収集部１２は、入力された文書から予め指定された２つの固有名詞の種別、例えば人名と地名や会社名と会社名のタグが付いている任意の２つの単語または単語列が共起して出現する文脈（単語列）を検出し、全ての文書に対してこのような文脈を共起単語対毎に収集し、共起単語の順序及び文脈の頻度とともに共起単語対文脈データベース２に格納する（ｓ２）。 The co-occurrence word-to-context collection unit 12 selects any two words or word strings having tags of two proper nouns specified in advance from the input document, for example, a person name and a place name, or a company name and a company name. Contexts (word strings) that co-occur on are detected, and such contexts are collected for each document for each pair of co-occurrence words. Store in the context database 2 (s2).

なお、ここでいう共起とは、同一の文内に同時に出現することを指すが、同一文内でも共起する２つの単語の距離はＮ単語（Ｎは整数）以内という条件を付けても良いし、これにさらに２つの単語の外側のＭ単語（Ｍは整数）を含むなどとしても良い。 The term “co-occurrence” as used herein refers to appearing in the same sentence at the same time, but the distance between two words that co-occur in the same sentence may be within N words (N is an integer). This may include M words (M is an integer) outside the two words.

図３は共起単語対文脈データベース２に格納される共起単語対と文脈の例である。共起単語対の種別は会社名と会社名であり、ここではCompany AとCompany Bの対と、Company CとCompany Dの対が示されている。２つの会社名の共起単語の間の５単語以内の文脈について、共起単語対毎に共起単語の順序と文脈の頻度とが格納される。共起単語の順序は、例えばCompany Aが先でCompany Bが後に出現する順序を０、Company Bが先でCompany Aが後に出現する順序を１と表現する。また、共起単語対データベース２に格納する際には、各共起単語対の頻度、即ち各共起単語対における全ての文脈の頻度の総和が予め定められた閾値を超えるものだけを格納するように限定しても良い。 FIG. 3 shows an example of co-occurrence word pairs and contexts stored in the co-occurrence word pair context database 2. The type of the co-occurrence word pair is a company name and a company name. Here, a pair of Company A and Company B and a pair of Company C and Company D are shown. For a context within 5 words between two company name co-occurrence words, the order of the co-occurrence words and the frequency of the context are stored for each co-occurrence word pair. For the order of co-occurrence words, for example, the order in which Company A appears first and Company B appears later is expressed as 0, and the order in which Company B appears first and Company A appears later is expressed as 1. Further, when storing in the co-occurrence word pair database 2, only the frequency of each co-occurrence word pair, that is, the sum of the frequencies of all contexts in each co-occurrence word pair exceeds a predetermined threshold value is stored. You may limit as follows.

文書頻度計算部１３は、入力された文書を単語に分解し、単語毎に当該単語を含む文書数をカウント（計数）し、これを全ての文書に対して行い、単語を含む文書数の全文書数に対する割合である文書頻度を全ての単語について計算し、文書頻度データベース３に格納する（ｓ３）。各単語ｗの文書頻度ｄｆ（ｗ）は次式により計算するが、これに限定するものではない。 The document frequency calculation unit 13 decomposes the input document into words, counts (counts) the number of documents including the word for each word, performs this for all documents, and calculates the total number of documents including the word. The document frequency, which is a ratio to the number of documents, is calculated for all words and stored in the document frequency database 3 (s3). The document frequency df (w) of each word w is calculated by the following formula, but is not limited to this.

ｄｆ（ｗ）＝ｌｏｇ（Ｃｗ／Ｎ）
但し、Ｃｗは単語ｗを含む文書数で、Ｎは文書集合における全文書数とする。 df (w) = log (Cw / N)
Here, Cw is the number of documents including the word w, and N is the total number of documents in the document set.

文書頻度データベース３は、単語ｗと文書集合全体における単語ｗの頻度Ｆ（ｗ）と計算された文書頻度ｄｆ（ｗ）から構成される。 The document frequency database 3 includes a word w, a frequency F (w) of the word w in the entire document set, and a calculated document frequency df (w).

文脈ベクトル生成部１４は、共起単語対文脈データベース２に格納されている各共起単語対ｘに関する文脈の集合Ｐを取り出し、この中の全ての文脈を単語に分割した後、各単語について、単語ｗを含む各文脈Ｐｉ（ｗ）の頻度Ｃ（Ｐｉ（ｗ））の総和ΣＣ（Ｐｉ（ｗ））を単語ｗの単語頻度ｔｆ（ｗ）として求め、文書頻度データベース３から単語ｗの文書頻度ｄｆ（ｗ）を参照し、得られた単語頻度ｔｆ（ｗ）と文書頻度ｄｆ（ｗ）とからその単語の重みＶｘ（ｗ）を決定し、共起単語対ｘ毎に個々の文脈を構成する単語とその重みからなる文脈ベクトルＶｘを生成し、文脈ベクトルデータベース４に格納する（ｓ４）。 The context vector generation unit 14 takes out a context set P regarding each co-occurrence word pair x stored in the co-occurrence word pair context database 2 and divides all the contexts therein into words. The sum ΣC (Pi (w)) of the frequency C (Pi (w)) of each context Pi (w) including the word w is obtained as the word frequency tf (w) of the word w, and the document of the word w from the document frequency database 3 The frequency df (w) is referred to, and the weight Vx (w) of the word is determined from the obtained word frequency tf (w) and the document frequency df (w), and the individual context for each co-occurrence word pair x is determined. A context vector Vx composed of constituent words and their weights is generated and stored in the context vector database 4 (s4).

図４に文脈ベクトルデータベース４に格納される文脈ベクトルの一例を示す。共起単語対毎に文脈ベクトルを構成する単語とその重みが格納される。 FIG. 4 shows an example of the context vector stored in the context vector database 4. The words constituting the context vector and their weights are stored for each co-occurrence word pair.

なお、一般的過ぎて意味のない単語を除くために、ストップワードとして、文書頻度データベース３に格納されている文書集合全体における単語ｗの頻度Ｆ（ｗ）が予め定められた閾値よりも高い単語は対象外としても良いし、あるいは前置詞や冠詞など品詞情報を用いて対象外とする単語を選択しても良い。逆に単語の頻度があまりにも低い特殊な単語も除外するために、予め定められた別の閾値よりも文書全体における単語の頻度が低い単語も対象外としても良い。また、活用のある単語は基本形を用いて統一しても良く、受動態に用いられる動詞の過去分詞だけを能動態における過去形等の他の活用形とは区別しても良い。 In order to remove words that are too general and meaningless, as a stop word, the word frequency F (w) in the entire document set stored in the document frequency database 3 is higher than a predetermined threshold. May be excluded, or a word to be excluded may be selected using part-of-speech information such as a preposition or an article. On the other hand, in order to exclude special words whose word frequency is too low, words whose word frequency in the entire document is lower than another predetermined threshold may be excluded. In addition, the words that are used may be unified using the basic form, and only the past participle of the verb used for passive voice may be distinguished from other used forms such as the past tense in active voice.

単語頻度を求めるには、例えば図３においてCompany C :: Company Dにおける文脈の単語buyの単語頻度は、buyを含む文脈のそれぞれの頻度から１１と８と３の総和で２２とする。また、単語頻度をカウントする際に、共起単語の順序を考慮し、ある単語における頻度が共起単語の順序が０の場合にＬ回で、１の場合にＲ回だとすると、その単語頻度をＬ−Ｒとしても良い。これにより、共起単語対の有する関係の方向を表現することも可能である。文脈の単語の重みの決定には、単語ｗの単語頻度ｔｆ（ｗ）と文書頻度ｄｆ（ｗ）の逆数の積であるｔｆ＊ｉｄｆを用いるが、これに限定されるものではない。 In order to obtain the word frequency, for example, in FIG. 3, the word frequency of the context word buy in Company C :: Company D is set to 22 as the sum of 11, 8, and 3 from the respective frequencies of the context including buy. Further, when counting the word frequency, the order of the co-occurrence words is taken into consideration, and if the frequency in a certain word is L times when the order of the co-occurrence words is 0 and R is 1 in the case of 1, the word frequency is LR may be used. Thereby, it is also possible to express the direction of the relationship that the co-occurrence word pair has. The context word weight is determined using tf * idf, which is the product of the inverse of the word frequency tf (w) of the word w and the document frequency df (w), but is not limited thereto.

文脈ベクトル類似度計算部１５は、文脈ベクトルデータベース４から２つの共起単語対に対応する文脈ベクトルを読み出し、その間の類似度を計算し、これを２つの共起単語対に対応する文脈ベクトルの全ての組み合わせについて行う。文脈ベクトルαとβの類似度Ｓｉｍ（α，β）は、下記の式から２つの文脈ベクトルのなす角度θの余弦ｃｏｓ（θ）を計算することで求める。 The context vector similarity calculation unit 15 reads the context vectors corresponding to the two co-occurrence word pairs from the context vector database 4, calculates the similarity between the two, and calculates the similarity between the context vector corresponding to the two co-occurrence word pairs. Repeat for all combinations. The similarity Sim (α, β) between the context vectors α and β is obtained by calculating the cosine cos (θ) of the angle θ formed by the two context vectors from the following equation.

Ｓｉｍ（α，β）＝ｃｏｓ（θ）＝（α・β）／（｜α｜｜β｜）
なお、図４の例ではベクトルを構成する単語の並びが異なるが、ベクトルの内積は２つのベクトルを構成する単語の並びを同じにして計算することは言うまでもない。この際、一方に存在し、他方に存在しない単語の重みは、存在しない方の単語の重みを０とする。計算された全ての組み合わせの文脈ベクトルの類似度は、文脈ベクトル類似度データベース５に格納する（ｓ５）。 Sim (α, β) = cos (θ) = (α · β) / (| α || β |)
In the example of FIG. 4, the arrangement of the words constituting the vector is different, but it goes without saying that the inner product of the vectors is calculated with the arrangement of the words constituting the two vectors being the same. At this time, the weight of a word that exists on one side and does not exist on the other side is 0. The calculated similarity of the context vectors of all combinations is stored in the context vector similarity database 5 (s5).

共起単語対クラスタリング部１６は、文脈ベクトル類似度データベース５に格納されている全ての文脈ベクトルと文脈ベクトル同士の類似度を参照し、類似している文脈ベクトルの階層的なクラスタをボトムアップに構築する。クラスタリングアルゴリズムは様々なものが提案されているが、ここでは特に規定しない。予め類似度に対する閾値を設けておき、その閾値以上で構築されたクラスタリングの結果を共起単語対クラスタデータベース６に格納する（ｓ６）。 The co-occurrence word pair clustering unit 16 refers to the similarity between all context vectors stored in the context vector similarity database 5 and the context vectors, and bottoms up a hierarchical cluster of similar context vectors. To construct. Various clustering algorithms have been proposed, but are not specified here. A threshold value for the similarity is set in advance, and the clustering result constructed above the threshold value is stored in the co-occurrence word pair cluster database 6 (s6).

関係ラベル獲得部１７は、共起単語対クラスタデータベース６から一のクラスタに含まれる個々の共起単語対を読み出し、文脈ベクトルデータベース４から前記個々の共起単語対に対応する文脈を構成する単語を読み出し、多くの共起単語対の文脈間で共通する単語の重なりの度合いを求め、当該単語を前記一のクラスタに含まれる個々の共起単語対の関係を表すラベルとしてその重なりの度合いとともに共起単語対クラスタデータベース６の対応するクラスタに格納し、これを全てのクラスタについて行う（ｓ７）。 The relation label acquisition unit 17 reads out individual co-occurrence word pairs included in one cluster from the co-occurrence word pair cluster database 6, and forms words corresponding to the individual co-occurrence word pairs from the context vector database 4. Is read out, the degree of overlapping of the words common among the contexts of many co-occurrence word pairs is obtained, and the word is used as a label representing the relationship between the individual co-occurrence word pairs included in the one cluster together with the degree of overlapping This is stored in the corresponding cluster of the co-occurrence word pair cluster database 6 and this is performed for all clusters (s7).

図５に共起単語対クラスタデータベース６の一例を示す。共起単語対クラスタデータベース６は、クラスタの番号、各クラスタに含まれる共起単語対、各クラスタの共起単語対の文脈に共通な単語とその重なり度合いからなっている。 FIG. 5 shows an example of the co-occurrence word pair cluster database 6. The co-occurrence word pair cluster database 6 includes a cluster number, a co-occurrence word pair included in each cluster, a word common to the context of the co-occurrence word pair of each cluster, and the overlapping degree thereof.

文脈ベクトルに共通な単語の重なり度合いを求めるには、例えば、全ての文脈ベクトルの組み合わせから共通する単語を検出し、その単語が文脈ベクトルの全組み合わせのうちいくつの組み合わせに存在するかを割合として求めても良い。一例として、クラスタが５つの共起単語対からなる場合には、１０組の共起単語対の組み合わせがあるが、仮に４つの共起単語対の文脈の集合にある単語が共通する場合には６組の共起単語対の組み合わせが存在することになるので、割合は０．６として求めることができる。なお、クラスタ内の全ての文脈ベクトルに共通の単語が存在する場合には、そのクラスタにおけるその単語の重なり度合いは１になる。 In order to obtain the overlapping degree of words common to context vectors, for example, a common word is detected from all combinations of context vectors, and the number of combinations of the words in all combinations of context vectors is used as a ratio. You may ask. As an example, if a cluster consists of five co-occurrence word pairs, there are 10 combinations of co-occurrence word pairs, but if the words in the context set of four co-occurrence word pairs are common, Since there are six combinations of co-occurrence word pairs, the ratio can be obtained as 0.6. When a common word exists in all context vectors in the cluster, the overlapping degree of the word in the cluster is 1.

クラスタ内文脈選択部１８は、（１）共起単語対クラスタデータベース６に格納されているクラスタ毎の共起単語対と共起単語対文脈データベース２に格納されている共起単語対の文脈の集合とから、クラスタ内の複数の共起単語対に共通する文脈だけを選択する（ｓ８）。 The intra-cluster context selector 18 (1) The context of the co-occurrence word pair stored in the co-occurrence word pair cluster database 6 and the co-occurrence word pair stored in the co-occurrence word pair context database 2 is stored. From the set, only contexts that are common to a plurality of co-occurrence word pairs in the cluster are selected (s8).

例えば、図３の", which is acquired by"という文脈はCompany A :: Company BとCompany C :: Company Dという２つの共起単語対に共通するので、この文脈をこれらの２つの共起単語対を含むクラスタの関係を表す表現として選択する。 For example, the context “, which is acquired by” in FIG. 3 is common to the two co-occurrence word pairs Company A :: Company B and Company C :: Company D, so this context is used for these two co-occurrence words. It is selected as an expression that represents the relationship between the clusters including the pair.

あるいは、（２）共起単語対クラスタデータベース６に格納されているクラスタ毎の共起単語対及び文脈ベクトルに共通な単語と共起単語対文脈データベース２に格納されている共起単語対の文脈の集合とから、クラスタ内の共起単語対の多くに共通な単語、例えば予め定められた閾値以上の重なり度合いを持つ単語を含む文脈だけを選択する（ｓ８）。 Or, (2) the context of the co-occurrence word pair stored in the co-occurrence word pair cluster database 6 and the co-occurrence word pair for each cluster and the word common to the context vector and the co-occurrence word pair context database 2 Only contexts that include words that are common to many of the co-occurrence word pairs in the cluster, for example, words that have a degree of overlap equal to or greater than a predetermined threshold are selected (s8).

例えば、図３と図５から図５におけるクラスタ１の共通単語のうち、重なり度合いが０．５以上の単語を含む文脈だけを選択するという条件を設けるならば、図５より重なり度合いが１である単語buy（過去分詞は除いている）だけがこの条件に該当する。この条件より、図３の文脈の中からbuy（過去分詞を除く）を含むCompany A :: Company Bの"is offering to buy"およびCompany C :: Company Dの"said it intends to buy"，"agreed to buy"，"plans to buy"の４つだけを選択することができる。 For example, if the condition that only a context including words having an overlapping degree of 0.5 or more is selected from the common words of cluster 1 in FIGS. 3 and 5 to 5 is shown, the overlapping degree is 1 from FIG. Only certain words buy (excluding past participle) meet this condition. From this condition, “is offering to buy” of Company A :: Company B and “said it intends to buy” of Company C :: Company D, including buy (excluding the past participle) in the context of FIG. You can select only 4 items: agreed to buy and plan to buy.

図５におけるクラスタ１は、文脈に共通の単語からＭ＆Ａの関係を表していると考えられ、とりわけ最も重なり度合いの高いbuyという単語はＭ＆Ａの関係を表すラベルとして見なすことができ、buyを含む文脈だけを選択することは、Ｍ＆Ａの関係を必ずしも表しているとはいえない文脈をフィルタリングすることと等価であるため、高い精度でＭ＆Ａの関係を表す言い換え表現だけを獲得することに繋がる。 Cluster 1 in FIG. 5 is considered to represent an M & A relationship from words common to the context, and the word “buy” having the highest degree of overlap can be regarded as a label representing the M & A relationship. Selecting only this is equivalent to filtering a context that does not necessarily represent the M & A relationship, and thus leads to obtaining only a paraphrase expression that represents the M & A relationship with high accuracy.

なお、文脈の選択には、上記の（１）と（２）の論理和を用いても良い。この場合、（１）で得られる１つと（２）で得られる４つの文脈がＭ＆Ａの関係を表す言い換え表現として獲得される。以上を各クラスタにおいても繰り返す。 Note that the logical sum of the above (1) and (2) may be used for context selection. In this case, the one obtained in (1) and the four contexts obtained in (2) are acquired as paraphrased expressions representing the M & A relationship. The above is repeated for each cluster.

出力部１９は、共起単語対クラスタデータベース６に格納されているクラスタとそれに含まれる共起単語対、関係ラベル獲得部１７で得られるクラスタの関係を表すラベルとしての単語、クラスタ内文脈選択部１８で得られる関係についての言い換え表現となる文脈を出力表示する（ｓ９）。 The output unit 19 includes a cluster stored in the co-occurrence word pair cluster database 6 and co-occurrence word pairs included therein, a word as a label representing the relationship between the clusters obtained by the relationship label acquisition unit 17, and an intra-cluster context selection unit. The context that is the paraphrase expression for the relationship obtained in 18 is output and displayed (s9).

本発明の言い換え表現獲得システムの実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the paraphrase expression acquisition system of this invention 本発明の言い換え表現獲得プログラムに対応する処理の流れの一例を示す図The figure which shows an example of the flow of a process corresponding to the paraphrase expression acquisition program of this invention 共起単語対文脈データベースの一例を示す図Figure showing an example of a co-occurrence word pair context database 文脈ベクトルデータベースの一例を示す図Figure showing an example of a context vector database 共起単語対クラスタデータベースの一例を示す図The figure which shows an example of a co-occurrence word pair cluster database

Explanation of symbols

１：文書集合データベース（文書集合ＤＢ）、２：共起単語対文脈データベース（共起単語対文脈ＤＢ）、３：文書頻度データベース（文書頻度ＤＢ）、４：文脈ベクトルデータベース（文脈ベクトルＤＢ）、５：文脈ベクトル類似度データベース（文脈ベクトル類似度ＤＢ）、６：共起単語対クラスタデータベース（共起単語対クラスタＤＢ）、１１：文書集合入力部、１２：共起単語対文脈収集部、１３：文書頻度計算部、１４：文脈ベクトル生成部、１５：文脈ベクトル類似度計算部、１６：共起単語対クラスタリング部、１７：関係ラベル獲得部、１８：クラスタ内文脈選択部、１９：出力部。 1: document set database (document set DB), 2: co-occurrence word pair context database (co-occurrence word pair context DB), 3: document frequency database (document frequency DB), 4: context vector database (context vector DB), 5: Context vector similarity database (context vector similarity DB), 6: Co-occurrence word pair cluster database (co-occurrence word pair cluster DB), 11: Document set input unit, 12: Co-occurrence word pair context collection unit, 13 : Document frequency calculation unit, 14: context vector generation unit, 15: context vector similarity calculation unit, 16: co-occurrence word pair clustering unit, 17: relation label acquisition unit, 18: intra-cluster context selection unit, 19: output unit .

Claims

A paraphrase expression acquisition system that acquires a paraphrase expression that expresses the same semantic content in different expressions from a document set,
A document collection database that stores multiple documents that are tagged with specific attributes in words or word strings,
A co-occurrence word pair context database that stores individual contexts for each co-occurrence word pair;
A context vector database storing a context vector for each co-occurrence word pair;
A context vector similarity database for storing similarity between context vectors;
A co-occurrence word pair cluster database storing co-occurrence word pairs for each cluster;
A document set input unit for inputting a document set from the document set database;
Collects contexts in which any two words or word strings tagged with a specific attribute from the input document appear together, and this is performed for all the documents. A co-occurrence word-to-context collection unit that stores an individual context in a co-occurrence word-to-context database for each co-occurrence word pair consisting of a combination of two words or word strings;
Read an individual context corresponding to one co-occurrence word pair from the co-occurrence word pair context database, and generate a context vector composed of words constituting the individual context corresponding to the one co-occurrence word pair and their weights, A context vector generation unit that stores in a context vector database and performs this for all co-occurrence word pairs;
A context vector corresponding to two co-occurrence word pairs is read from the context vector database, a similarity between them is calculated, and this is performed for all combinations of context vectors corresponding to the two co-occurrence word pairs. A context vector similarity calculator stored in the database;
Reads all similarity between the context vector from the context vector similarity database, to form a cluster containing can co caused word pairs based on the similarity between the context vector, co-occurrence and co-occurrence word pairs contained in each cluster A co-occurrence word pair clustering unit for storing in a word pair cluster database;
It reads the individual co-occurrence word pairs included in one cluster from the co-occurrence word pairs cluster database of words constituting a context corresponding to each of the co-occurrence word pairs in common between the context of co-electromotive word pair the word Is obtained as a label representing the relationship between individual co-occurrence word pairs included in the one cluster, stored in the corresponding cluster of the co-occurrence word pair cluster database, and this is performed for all clusters,
A paraphrase expression acquisition system comprising: an output unit that outputs a word as a label representing a relationship in the obtained cluster.

A paraphrase expression acquisition system that acquires a paraphrase expression that expresses the same semantic content in different expressions from a document set,
A document collection database that stores multiple documents that are tagged with specific attributes in words or word strings,
A co-occurrence word pair context database that stores the individual context for each co-occurrence word pair along with at least its frequency;
A document frequency database that stores the document frequency for each word;
A context vector database storing a context vector for each co-occurrence word pair;
A context vector similarity database for storing similarity between context vectors;
A co-occurrence word pair cluster database storing co-occurrence word pairs for each cluster;
A document set input unit for inputting a document set from the document set database;
Collects contexts in which any two words or word strings tagged with a specific attribute from the input document appear together, and this is performed for all the documents. A co-occurrence word-to-context collection unit that stores an individual context together with its frequency in a co-occurrence word-to-context database for each co-occurrence word pair consisting of a combination of two words or word strings;
The input document is decomposed into words, the number of documents including the word is counted for each word, this is performed for all documents, and the document frequency that is the ratio of the number of documents including words to the total number of documents is all A document frequency calculation unit that calculates a word of the word and stores it in a document frequency database;
Each context and its frequency corresponding to one co-occurrence word pair is read from the co-occurrence word pair context database, each context is divided into words, and the sum of the frequency of the context including the word is obtained for each word. The word frequency of the word, the document frequency of each word is read from the document frequency database, the weight of each word is calculated from both, and the word constituting the individual context corresponding to the one co-occurrence word pair and its weight Generating a context vector, storing it in a context vector database, and performing this for all co-occurrence word pairs;
A context vector corresponding to two co-occurrence word pairs is read from the context vector database, a similarity between them is calculated, and this is performed for all combinations of context vectors corresponding to the two co-occurrence word pairs. A context vector similarity calculator stored in the database;
Reads all similarities between context vectors from the context vector similarity database, forms a cluster containing co-occurrence word pairs based on the similarity between context vectors, and sets the co-occurrence word pairs included in each cluster as co-occurrence word pairs A co-occurrence word pair clustering unit to be stored in the cluster database;
Reads the individual co-occurrence word pairs included in one cluster from the co-occurrence word pairs cluster database, reads the words constituting the context corresponding to the respective co-occurring word pair from the context vector database, co caused word pairs context A common word between them is acquired as a label representing the relationship between individual co-occurrence word pairs included in the one cluster, stored in the corresponding cluster of the co-occurrence word pair cluster database, and this is performed for all clusters A label acquisition department;
A paraphrase expression acquisition system comprising: an output unit that outputs a word as a label representing a relationship in the obtained cluster.

Intra-cluster context selection that acquires words containing words as labels representing contexts and relationships common to individual co-occurrence word pairs contained in each cluster from the co-occurrence word pair cluster database and the co-occurrence word pair context database Further comprising
The output unit further outputs the obtained paraphrase expression.
The paraphrase expression acquisition system according to claim 1 or 2, wherein

The paraphrase expression acquisition program for functioning a computer as each means which comprises the paraphrase expression acquisition system of any one of Claims 1 thru | or 3 .