JP2013222418A

JP2013222418A - Passage division method, device and program

Info

Publication number: JP2013222418A
Application number: JP2012095344A
Authority: JP
Inventors: Yasuki Kakishita; 容弓柿下; Hideharu Hattori; 英春服部; Tomokazu Murakami; 智一村上; Osamu Konichi; 修今一
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-04-19
Filing date: 2012-04-19
Publication date: 2013-10-28
Anticipated expiration: 2032-04-19
Also published as: JP5869948B2; CN103377187A; CN103377187B

Abstract

PROBLEM TO BE SOLVED: To solve such a problem that in the conventional method, it is difficult to correctly divide a passage when a plurality of passages containing sentences with kindred meaning and similar feature quantity are included in one document.SOLUTION: A passage division device 100, under control of a control unit 101, divides a document input from an input unit 102 into sentence units at a sentence division unit 103. A feature quantity calculation unit 104, with the divided sentence as a query, performs associative retrieval of a document which is stored beforehand in a corpus unit 111 and acquires a document vector. A similarity calculation unit 105 retrieves two document vectors whose similarity becomes maximum, and when the similarity is equal to or larger than a prescribed threshold, a retrieval query generation unit 106 consolidates the two sentences to generate a query as a common element. The feature quantity calculation unit 104 regenerates a document vector by using this query. A feature quantity update unit 107 updates the feature quantity on the basis of its reliability, and connects corresponding sentences sequentially to make a passage while updating the feature quantity.

Description

本発明は、電子化された文書の処理に係り、特に電子化書類のパセージ分割技術に関する。 The present invention relates to processing of an electronic document, and more particularly to a passage dividing technique for an electronic document.

近年、文書の電子化やデータベース化が進んだことで、自然言語処理技術も大きく発展し、例えば文書の自動要約や文書検索のための自動キーワード抽出などの研究が多くなされてきた。しかしこれらの技術の対象となる文書はパッセージ毎、すなわち、話題、あるいは内容的、意味的なまとまり単位毎に分割されている、または単一のパッセージしか含まない文書を想定していることが多い。そのため、複数のパッセージを含む文書に対しては、予めパッセージを分割することが有効である。従来、このようなパッセージ分割手法としては、特許文献１や特許文献２に記載のテキストセグメンテーション手法等が知られている。 In recent years, with the progress of computerization of documents and creation of databases, natural language processing technology has greatly developed. For example, much research has been conducted on automatic summarization of documents and automatic keyword extraction for document retrieval. However, it is often assumed that the documents covered by these technologies are passages, that is, documents that are divided into topics, that is, divided into topical or content and semantic units, or that contain only a single passage. . Therefore, for a document including a plurality of passages, dividing the passages in advance is effective. Conventionally, as such a passage division method, a text segmentation method described in Patent Literature 1 and Patent Literature 2 is known.

特開２００９−１５７９５号公報JP 2009-15595 A 特開２００４−１４５７９０号公報JP 2004-145790 A

しかし、従来のパッセージ分割、テキストセグメンテーションに関する手法は意味の近い文、すなわちその特徴量が似た文を含む複数のパッセージが、一つの文書に含まれる場合、パッセージを正しく分割することが難しい。その結果、文書の自動要約や文書検索のための自動キーワード抽出などを効率的に進めることができない。 However, in the conventional methods for dividing passages and text segmentation, it is difficult to correctly divide passages when a plurality of passages including sentences having similar meanings, that is, sentences having similar features, are included in one document. As a result, automatic summarization of documents and automatic keyword extraction for document retrieval cannot be efficiently advanced.

本発明の目的は、上記課題に鑑みてなされたものであり、複数のパッセージを含む文書を有効に分割するパッセージ分割方法、装置、及びプログラムを提供することにある。 An object of the present invention is to provide a passage dividing method, apparatus, and program for effectively dividing a document including a plurality of passages.

上記の目的を達成するため、本発明においては、処理部により、ドキュメントをパッセージに分割するパッセージ分割方法であって、処理部は、ドキュメントを文単位に分割し、分割した文をクエリとして、予め記憶されている複数のドキュメントから、関連するドキュメントを抽出して、特徴量を作成し、作成した特徴量の内の二つの特徴量の類似度が所定の閾値以上である、当該二つの特徴量の共通要素を用いて特徴量を更新するパッセージ分割方法を提供する。 In order to achieve the above object, the present invention provides a passage dividing method in which a processing unit divides a document into passages. The processing unit divides the document into sentence units, and uses the divided sentences as queries. A related document is extracted from a plurality of stored documents to create a feature quantity, and the two feature quantities in the created feature quantities have a similarity equal to or greater than a predetermined threshold. There is provided a passage dividing method for updating feature amounts using the common elements.

又、上記の目的を達成するため、本発明においては、入力されるドキュメントをパッセージに分割するパッセージ分割装置であって、処理部と記憶部とを備え、処理部は、ドキュメントを文単位に分割し、分割、記憶した文をクエリとして、予め記憶部に記憶されている複数のドキュメントから、関連するドキュメントを抽出して、特徴量を作成し、作成した特徴量の内の二つの類似度が所定の閾値以上である、当該特徴量の共通要素を用いて特徴量を更新する構成のパッセージ分割装置を提供する。 In order to achieve the above object, according to the present invention, there is provided a passage dividing apparatus for dividing an input document into passages, comprising a processing unit and a storage unit, and the processing unit divides the document into sentence units. Then, using the divided and stored sentences as queries, extracting related documents from a plurality of documents stored in the storage unit in advance, creating feature amounts, and the two similarities of the created feature amounts are Provided is a passage dividing device configured to update a feature amount using a common element of the feature amount that is equal to or greater than a predetermined threshold.

更に、上記の目的を達成するため、本発明においては、処理部と記憶部とを備え、入力されるドキュメントをパッセージに分割するパッセージ分割装置の処理部で実行されるパッセージ分割プログラムであって、処理部を、ドキュメントを文単位に分割し、分割した文をクエリとして、予め記憶部に記憶されている複数のドキュメントから、関連するドキュメントを抽出し、抽出した関連するドキュメントを用いて特徴量を作成し、作成した特徴量の内の二つの類似度が所定の閾値以上である、当該特徴量の共通要素を用いて特徴量を更新するよう動作させるパッセージ分割プログラムを提供する。 Furthermore, in order to achieve the above object, in the present invention, there is provided a passage dividing program executed by a processing unit of a passage dividing apparatus that includes a processing unit and a storage unit and divides an input document into passages, The processing unit divides the document into sentence units, uses the divided sentence as a query, extracts related documents from a plurality of documents stored in the storage unit in advance, and uses the extracted related documents to determine the feature amount. Provided is a passage dividing program that is operated so as to update a feature quantity using a common element of the feature quantities, in which two similarities of the created feature quantities are equal to or greater than a predetermined threshold.

本発明によれば、意味の近い文、すなわち特徴量が似た文を含む、複数のパッセージが一つの文書に含まれる場合でも、パッセージを正しく分割することが可能となる。 According to the present invention, even when a plurality of passages including sentences having similar meanings, that is, sentences having similar feature quantities, are included in one document, the passages can be correctly divided.

第１の実施例のパッセージ分割装置の一機能構成を示す図である。It is a figure which shows one function structure of the passage division | segmentation apparatus of 1st Example. 第１の実施例のパッセージ分割装置の一ハードウェア構成を示す図である。It is a figure which shows one hardware constitutions of the passage division | segmentation apparatus of a 1st Example. 第１の実施例に係る、パッセージ分割プログラムの動作の一例を示す図である。It is a figure which shows an example of operation | movement of the passage division | segmentation program based on 1st Example. 第１の実施例に係る、ドキュメントベクトルの類似度に応じて文が連結される様子を示す図である。It is a figure which shows a mode that a sentence is connected according to the similarity of a document vector based on a 1st Example. 第２の実施例のパッセージ分割装置の一機能構成を示す図である。It is a figure which shows one function structure of the passage division | segmentation apparatus of a 2nd Example. 第２の実施例に係る、パッセージ分割プログラムの動作の一例を示す図である。It is a figure which shows an example of operation | movement of the passage division | segmentation program based on 2nd Example. 各実施例に係る、ドキュメントベクトルの一例を説明するための図である。It is a figure for demonstrating an example of a document vector based on each Example. 各実施例に係る、単語ベクトルの一例を説明するための図である。It is a figure for demonstrating an example of a word vector based on each Example.

以下、本発明の実施例を図面に従い説明するが、本発明は以下に説明する実施例に限定されるものではない。本明細書において、「文書」と「ドキュメント」とは、同義であることとする。また、「パッセージ」とは、話題、あるいは内容的、意味的なまとまりのある単位を意味する。更に、ドキュメントベクトルとは、蓄積されたドキュメントを次元とするベクトルを意味し、単語ベクトルとは、全ドキュメント中に出現する全ての単語を次元とするベクトルを意味するものとする。そして、本明細書において、文の「特徴量」とは、文の意味を定量的に示すものであり、例えば、ドキュメントベクトル、あるいは単語ベクトルはその一例として説明する。 Examples of the present invention will be described below with reference to the drawings. However, the present invention is not limited to the examples described below. In this specification, “document” and “document” are synonymous. The “passage” means a unit having a topic or content and semantic unit. Further, a document vector means a vector whose dimension is an accumulated document, and a word vector means a vector whose dimension is all words appearing in all documents. In this specification, the “feature amount” of a sentence quantitatively indicates the meaning of the sentence. For example, a document vector or a word vector will be described as an example.

第１の実施例は、類似度計算にドキュメントベクトルを、類似文書検索に単語ベクトルを用いるパッセージ分割方法、装置、及びプログラムの実施例である。本実施例において、ドキュメントベクトルとは、分割装置のコーパス部に含まれる全てのドキュメントを次元とするベクトルである。 The first embodiment is an embodiment of a passage dividing method, apparatus, and program that uses a document vector for similarity calculation and a word vector for similar document search. In this embodiment, the document vector is a vector whose dimensions are all documents included in the corpus unit of the dividing device.

本実施例の詳細を説明するに先立ち、ドキュメントベクトルと単語ベクトルの一例を説明する。
図６にドキュメントベクトルの一例を示す。図６において、コーパス部に含まれるドキュメントの総数を１０として例示した。そして、検索の結果得られるドキュメントが、１、３、４、８である場合、ドキュメントベクトルは、同図の（ａ）に示すドキュメントベクトル６０１ように表わすことができる。同様に、検索の結果、検索スコアが得られる場合、得られた検索スコアを用いて、同図の（ｂ）に示すようなドキュメントベクトル６０２として表わすことができる。 Prior to describing the details of this embodiment, an example of a document vector and a word vector will be described.
FIG. 6 shows an example of a document vector. In FIG. 6, the total number of documents included in the corpus is illustrated as 10. When the documents obtained as a result of the search are 1, 3, 4, and 8, the document vector can be expressed as a document vector 601 shown in FIG. Similarly, when a search score is obtained as a result of the search, the obtained search score can be used to represent a document vector 602 as shown in FIG.

図７に単語ベクトルの一例を示した。単語ベクトルとは、全文書中に出現する全ての単語を次元とするベクトルであり、図７の単語ベクトルでは、全てのドキュメントに出現する単語の種類を１０として例示した。そして、あるドキュメントに含まれる単語が、３、６、７、８であり、出願頻度がそれぞれ、１、５、３、９である場合、該当する要素に出現頻度を代入することで、同図に示す単語ベクトル７０１を得る。 FIG. 7 shows an example of a word vector. The word vector is a vector whose dimensions are all words appearing in all documents. In the word vector of FIG. 7, the types of words appearing in all documents are exemplified as 10. Then, if the words included in a document are 3, 6, 7, and 8 and the application frequencies are 1, 5, 3, and 9, respectively, the appearance frequency is substituted into the corresponding element, so that A word vector 701 shown in FIG.

図１Ａは、実施例１に係るパッセージ分割装置の機能ブロックの一例を示す図である。図１Ｂは、実施例１のパッセージ分割装置を実現するハードウェア構成の一例を示す図である。図１Ｂのハードウェア構成は、通常の処理部である中央処理部（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：ＣＰＵ）１１、メモリ、ＲＡＭ、ＲＯＭ、ハードディスクドライブ（ＨＤＤ）、記憶装置等の記憶部１２、入出力部１３、ネットワークインタフェースである通信部１４からなり、これらの各ブロックは、内部バス１５によって相互に接続されているコンピュータを示している。 FIG. 1A is a diagram illustrating an example of functional blocks of the passage dividing apparatus according to the first embodiment. FIG. 1B is a diagram illustrating an example of a hardware configuration that implements the passage dividing apparatus according to the first embodiment. 1B includes a central processing unit (CPU) 11 that is a normal processing unit, a memory, a RAM, a ROM, a hard disk drive (HDD), a storage unit 12 such as a storage device, and an input / output unit 13. The communication unit 14 is a network interface, and each of these blocks represents a computer connected to each other by an internal bus 15.

図１Ａにおいて、パッセージ分割装置１００は、制御部１０１と、入力部１０２と、文分割部１０３と、特徴量算出部１０４と、類似度計算部１０５と、検索クエリ生成部１０６と、特徴量更新部１０７と、パッセージ更新部１０８と、出力部１０９と、文記憶部１１０と、コーパス部１１１と、特徴量記憶部１１２と、パッセージ記憶部１１３と、形態素解析部１１４とを有する。前提として、コーパス部１１１には、例えば新聞記事のような文書、ドキュメントがＳ_Ｄ個記憶されているものとする。 1A, a passage dividing apparatus 100 includes a control unit 101, an input unit 102, a sentence dividing unit 103, a feature amount calculating unit 104, a similarity calculating unit 105, a search query generating unit 106, and a feature amount updating. Unit 107, passage update unit 108, output unit 109, sentence storage unit 110, corpus unit 111, feature amount storage unit 112, passage storage unit 113, and morpheme analysis unit 114. As a premise, it is assumed that the corpus unit 111 stores, for example, a document such as a newspaper article and _SD documents.

この内、入力部１０２、出力部１０９が入出力部１３や通信部１４に対応し、文記憶部１１０と、コーパス部１１１と、特徴量記憶部１１２と、パッセージ記憶部１１３が記憶部１２のメモリや記憶装置に対応している。その余の制御部１０１、文分割部１０３と、特徴量算出部１０４と、類似度計算部１０５と、検索クエリ生成部１０６と、特徴量更新部１０７と、パッセージ更新部１０８と、形態素解析部１１４は、ＣＰＵ１１における、オペレーティングシステム（ＯＳ）や、ＲＯＭ等の記憶部に記憶された各種のプログラムの処理で実現できる。 Among them, the input unit 102 and the output unit 109 correspond to the input / output unit 13 and the communication unit 14, and the sentence storage unit 110, the corpus unit 111, the feature amount storage unit 112, and the passage storage unit 113 are stored in the storage unit 12. Supports memory and storage devices. The remaining control unit 101, sentence division unit 103, feature amount calculation unit 104, similarity calculation unit 105, search query generation unit 106, feature amount update unit 107, passage update unit 108, and morpheme analysis unit 114 can be realized by processing of various programs stored in a storage unit such as an operating system (OS) or a ROM in the CPU 11.

図１Ａに示した実施例１のパッセージ分割装置の各機能ブロックの動きを順次説明する。
まず、パッセージ分割の対象となるドキュメントが入力部１０２から装置に入力される。文分割部１０３は、処理部であるＣＰＵ１１の所定プログラムの実行により、入力されたドキュメントを文単位に分割し、文記憶部１１０に分割結果である複数の文を記憶する。 The movement of each functional block of the passage dividing apparatus according to the first embodiment shown in FIG.
First, a document to be subjected to passage division is input from the input unit 102 to the apparatus. The sentence dividing unit 103 divides the input document into sentence units by executing a predetermined program of the CPU 11 which is a processing unit, and stores a plurality of sentences as division results in the sentence storage unit 110.

同様に、特徴量算出部１０４は、文記憶部１１０から読み込んだ文各々を用いて、コーパス部１１１から関連するドキュメントを取得し、得られた複数の関連ドキュメントを、ドキュメントベクトル化して特徴量記憶部１１２に記憶する。すなわち、特徴量算出部１０４は、取得した関連ドキュメントに対応する次元に値を代入することで、図６で例示したようなドキュメントベクトルを生成する。 Similarly, the feature amount calculation unit 104 acquires a related document from the corpus unit 111 using each of the sentences read from the sentence storage unit 110, converts the obtained plurality of related documents into a document vector, and stores the feature amount. Store in the unit 112. That is, the feature amount calculation unit 104 generates a document vector as illustrated in FIG. 6 by substituting a value into a dimension corresponding to the acquired related document.

検索クエリ生成部１０６は、検索クエリを生成し、制御部１０１に送る機能を持つ。 The search query generation unit 106 has a function of generating a search query and sending it to the control unit 101.

特徴量算出部１０４は、制御部１０１を介して、検索クエリが与えられた場合、当該検索クエリに関連するドキュメントを文記憶部１１０から取得し、得られた複数の関連ドキュメントをドキュメントベクトル化し、特徴量として、特徴量記憶部に１１２に記憶すると共に、制御部１０１を介して、特徴量更新部１０７に出力する。 When a search query is given via the control unit 101, the feature amount calculation unit 104 acquires a document related to the search query from the sentence storage unit 110, converts the obtained plurality of related documents into a document vector, As a feature value, the feature value is stored in the feature value storage unit 112 and output to the feature value update unit 107 via the control unit 101.

類似度計算部１０５は、制御部１０１の指定に基づいて、二つのドキュメントベクトルを特徴量記憶部１１２から読み出し、二つのドキュメントベクトルの類似度を計算する機能を有する。本実施例における類似度の計算方法については後述する。更に、類似度計算部１０５は、計算して得られた類似度が所定の閾値以上か否かを判断する。 The similarity calculation unit 105 has a function of reading two document vectors from the feature amount storage unit 112 based on the designation of the control unit 101 and calculating the similarity of the two document vectors. A method for calculating similarity in this embodiment will be described later. Further, the similarity calculation unit 105 determines whether or not the similarity obtained by the calculation is equal to or greater than a predetermined threshold value.

検索クエリ生成部１０６は、制御部１０１の指定に基づいて、二つのドキュメントベクトルを特徴量記憶部１１２から読み出し、二つのドキュメントベクトルに共通するドキュメント群をコーパス部１１１から抽出する。抽出された共通するドキュメント群から検索クエリを生成し、制御部１０１へ出力する。この検索クエリの生成方法については後述する。 The search query generation unit 106 reads two document vectors from the feature amount storage unit 112 based on the designation of the control unit 101, and extracts a document group common to the two document vectors from the corpus unit 111. A search query is generated from the extracted common document group and output to the control unit 101. A method for generating this search query will be described later.

特徴量更新部１０７は、制御部１０１の指定に基づいて二つのドキュメントベクトルＶ_ｉ，Ｖ_ｊを特徴量記憶部１１２から読み出す。また制御部１０１から一つのドキュメントベクトルＶ_ｋが特徴量更新部１０７に入力される。入力された三つのドキュメントベクトルＶ_ｋ，Ｖ_ｉ，Ｖ_ｊから信頼度を計算し、信頼度に基づいてＶ_ｋを修正する。この信頼度については後述する。その後、Ｖ_ｉ，Ｖ_ｊを特徴量記憶部１１２から削除し、Ｖ_ｋを特徴量記憶部１１２に記憶する。 The feature amount update unit 107 reads two document vectors V _i and V _j from the feature amount storage unit 112 based on the designation of the control unit 101. Also, one document vector V _k is input from the control unit 101 to the feature amount update unit 107. The reliability is calculated from the three input document vectors V _k , V _i and V _j , and V _k is corrected based on the reliability. This reliability will be described later. Thereafter, V _i and V _j are deleted from the feature amount storage unit 112, and V _k is stored in the feature amount storage unit 112.

パッセージ更新部１０８は、制御部１０１の指定に基づいて、文記憶部１１０またはパッセージ記憶部１１３の中から二つの文またはパッセージ候補を読み出す。読み出された文またはパッセージ候補を文記憶部１１０またはパッセージ記憶部１１３の中から削除し、読み出された文またはパッセージ候補を連結して、その連結結果を、パッセージ候補としてパッセージ記憶部１１３に記憶する。 The passage update unit 108 reads out two sentences or passage candidates from the sentence storage unit 110 or the passage storage unit 113 based on the designation of the control unit 101. The read sentence or passage candidate is deleted from the sentence storage unit 110 or the passage storage unit 113, the read sentence or passage candidate is connected, and the connection result is stored in the passage storage unit 113 as a passage candidate. Remember.

出力部１０９は文記憶部１１０とパッセージ記憶部１１３からそれぞれ文、パッセージ候補を読み出し、不明パッセージか否かを判定した上で、その判定結果に基づき、パッセージにラベルを付与して出力する。ここで不明パッセージとは、どのパッセージと連結するか判定できなかった文またはパッセージ候補を指す。不明パッセージの判定方法については後述する。 The output unit 109 reads a sentence and a passage candidate from the sentence storage unit 110 and the passage storage unit 113, determines whether the passage is an unknown passage, and outputs a label with a label based on the determination result. Here, the unknown passage refers to a sentence or a passage candidate that cannot be determined as to which passage to connect. A method for determining the unknown passage will be described later.

図２は本実施例に係るパッセージ分割装置で実行されるパッセージ分割プログラムの動作を示すフロー図である。以下、図２を用いてパッセージ分割プログラムの動作の一例について説明する。
ここでは例として、二つのパッセージを含むドキュメントが入力された場合について述べるが、入力されるドキュメント中のパッセージ数は二つ以上であっても良く、以後の処理は同じであるので、二つのパッセージを含むドキュメントを例にして説明する。 FIG. 2 is a flowchart showing the operation of the passage dividing program executed by the passage dividing apparatus according to this embodiment. Hereinafter, an example of the operation of the passage dividing program will be described with reference to FIG.
Here, as an example, the case where a document including two passages is input will be described. However, the number of passages in the input document may be two or more, and the subsequent processing is the same. An example of a document including

第一のパッセージに含まれる文をａ_１，ａ_２，…，ａ_Ｎ、第二のパッセージに含まれる文をｂ_１，ｂ_２，…，ｂ_Ｍと定義する。ここでＮは第一のパッセージに含まれる文の数（自然数）、Ｍは第二のパッセージに含まれる文の数（自然数）である。 _A _1, a 2 statements contained in the first passage, ..., _{a N,} the statements contained in the second passage _b _1, _b _{2, ...,} is defined as _{b M.} Here, N is the number of sentences (natural number) included in the first passage, and M is the number of sentences (natural number) included in the second passage.

まず、ステップ２０１で入力部１０２からドキュメントが入力される。
ステップ２０２では入力されたドキュメントが、文分割部１０３により文単位に分割され、文記憶部１１０に記憶される。 First, in step 201, a document is input from the input unit 102.
In step 202, the input document is divided into sentence units by the sentence dividing unit 103 and stored in the sentence storage unit 110.

ステップ２０３では文記憶部１１０に記憶された全ての文ａ_１，ａ_２，…，ａ_Ｎ、ｂ_１，ｂ_２，…，ｂ_Ｍを特徴量算出部１０４に入力し、先に説明した通り、ドキュメントベクトルを得る。ドキュメントベクトルの算出方法としては、例えば、コサイン尺度を用いる方法が挙げられる。コサイン尺度とは二つのベクトルの類似度を計る手法の一つとして用いられるものである。二つのベクトルＱ、Ｐのコサイン尺度は以下の式１で計算される。 In step 203, all the sentences a ₁ , a ₂ ,..., A _N , b ₁ , b ₂ ,..., B _M stored in the sentence storage unit 110 are input to the feature quantity calculation unit 104, as described above. Get the document vector. As a method for calculating the document vector, for example, a method using a cosine scale can be cited. The cosine scale is used as one of methods for measuring the similarity between two vectors. The cosine measure of the two vectors Q and P is calculated by Equation 1 below.

本実施例においては、上述の通り、類似するドキュメントの検索に単語ベクトルを用いる。そこで、例えば、コーパス部１１１に記憶された各ドキュメントに対して、含まれる単語の出現頻度を要素とする単語ベクトルＷ_ｉ（０≦ｉ＜Ｓ_Ｄ）を作成しておく。入力された文についても同様に単語ベクトル化し、Ｗ_currentとする。単語ベクトルＷ_currentと、単語ベクトルＷ_ｉ（０≦ｉ＜Ｓ_Ｄ）のコサイン尺度を計算し、得られた類似度が高いドキュメントからＬ番目（Ｌは所定の自然数）までのドキュメントを得て、ドキュメントベクトル化し、特徴量記憶部１１２に蓄積する。

In this embodiment, as described above, word vectors are used for searching for similar documents. Therefore, for example, for each document stored in the corpus unit 111, a word vector W _i (0 ≦ i <S _D ) having the appearance frequency of the included word as an element is created. Similarly, the input sentence is converted into a word vector, and is set as W _current . The cosine measure of the word vector W _current and the word vector W _i (0 ≦ i <S _D ) is calculated, and the obtained documents from the high similarity to the Lth (L is a predetermined natural number) are obtained. The document is vectorized and stored in the feature amount storage unit 112.

尚、ここでは類似度計算の例として、コサイン尺度を用いたが、その他の尺度を用いて、類似度を計算しても良い。ドキュメントベクトルの各要素の値としては、図６の（ａ）、（ｂ）で説明したように、選定されたドキュメントは１、その他のドキュメントは０としても良いし、算出された類似度を用いるなど、なんらかの重み付けを行っても良い。 Although the cosine scale is used here as an example of similarity calculation, the similarity may be calculated using other scales. As described in FIGS. 6A and 6B, the value of each element of the document vector may be 1 for the selected document and 0 for the other documents, or the calculated similarity may be used. For example, some weighting may be performed.

次にステップ２０４では、特徴量記憶部１１２に蓄積されているドキュメントベクトルを二つ読み出し、類似度計算部１０５を用いて、最も類似度の高いドキュメントベクトルの組Ｖ_ｉ，Ｖ_ｊを見つける。この場合における類似度の計算方法としては、上述したコサイン尺度等を用いても良いし、二つのドキュメントベクトルの両方に存在する要素、すなわち共通要素の数などを用いても良い。 Next, in step 204, two document vectors stored in the feature amount storage unit 112 are read out, and the similarity calculation unit 105 is used to find a set of document vectors V _i and V _j having the highest similarity. As a method for calculating the similarity in this case, the above-described cosine scale or the like may be used, or the elements existing in both of the two document vectors, that is, the number of common elements may be used.

ステップ２０５では、類似度計算部１０５が、ステップ２０４で算出した最大類似度が、予め設定した閾値以上か否かを判定する。閾値は予め設定した固定値でも良いし、ステップ２０４で類似度を計算した際に、計算した類似度の平均や分散を計算しておき、これを用いても良い。 In step 205, the similarity calculation unit 105 determines whether or not the maximum similarity calculated in step 204 is greater than or equal to a preset threshold value. The threshold value may be a fixed value set in advance, or when the similarity is calculated in step 204, the average or variance of the calculated similarity may be calculated and used.

ステップ２０６およびステップ２０７は検索クエリ生成部１０６にて行われる。ステップ２０６では、ステップ２０４で算出された最大類似度が閾値以上である場合、ドキュメントベクトルの組Ｖ_ｉ，Ｖ_ｊの共通要素を抽出し、これをドキュメントベクトルの共通要素Ｖ_ｉｊとする。 Step 206 and step 207 are performed by the search query generation unit 106. In step 206, if the maximum similarity calculated in step 204 is greater than or equal to the threshold value, the common elements of the document vector sets V _i and V _j are extracted and set as the common elements V _ij of the document vector.

ステップ２０７では、ステップ２０６で得られた共通要素Ｖ_ｉｊから検索クエリを生成する。検索クエリの生成方法としては、例えばＴＦＩＤＦを用いた方法が挙げられる。ＴＦＩＤＦとは単語に関する重みの一種である。ＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）とＩＤＦ（ＩｎＶｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）はそれぞれ次の式で表され、ＴＦＩＤＦはＴＦとＩＤＦの積で求められる。 In step 207, a search query is generated from the common element V _ij obtained in step 206. An example of a search query generation method is a method using TFIDF. TFIDF is a kind of weight related to words. TF (Term Frequency) and IDF (Inverse Document Frequency) are respectively expressed by the following equations, and TFIDF is obtained by the product of TF and IDF.

ここでｎ_ｉはドキュメントｄにおける単語ｉの出現回数、｜Ｄ｜は総ドキュメント数、｜｛ｄ：ｔ_ｉ∈ｄ｝｜は単語ｔ_ｉを含むドキュメント数である。本実施例においては、総ドキュメント数Ｄはコーパス部１１１に記憶されている全ドキュメント数に相当する。

Here, n _i is the number of occurrences of the word i in the document d, | D | is the total number of documents, and | {d: t _i εd} | is the number of documents including the word t _i . In the present embodiment, the total document number D corresponds to the total number of documents stored in the corpus unit 111.

ドキュメントdに対して、形態素解析部１１４を用いて形態素解析を行い、ＴＦＩＤＦが大きい順にＳ_Ｗ個の単語を抽出し、これを検索クエリとする。ＴＦＩＤＦ以外でも、例えば出現頻度の多さで重要度を決めても良いし、ドキュメントのタイトルをクエリとしても良いし、その他の方法で検索クエリを生成しても良い。 The document d, performs morphological analysis by using the morphological analysis unit 114 extracts S _W number of words in order TFIDF is large, a search query for this. In addition to TFIDF, for example, the importance may be determined by the frequency of appearance, the document title may be used as a query, or a search query may be generated by other methods.

ステップ２０８では、ステップ２０７で生成された検索クエリを、制御部１０１を介して特徴量算出部１０４に入力し、特徴量算出部１０４において、新たなドキュメントベクトルＶ’_ｉｊを得る。 In step 208, the search query generated in step 207 is input to the feature amount calculation unit 104 via the control unit 101, and the feature amount calculation unit 104 obtains a new document vector _V′ij .

続いて、新たに得られたドキュメントベクトルＶ’_ｉｊの信頼度の計算等を実行するステップ２０９およびステップ２１０を実行する。これらのステップ２０９およびステップ２１０は、図１に示した特徴量更新部１０７にて実行される。まず、ステップ２０９では、ステップ２０８で得られたドキュメントベクトルＶ’_ｉｊの信頼度を計算し、その結果に応じて、ドキュメントベクトルのベクトルサイズを修正する。 Subsequently, Step 209 and Step 210 for executing the calculation of the reliability of the newly obtained document vector V ′ _{ij and} the like are executed. These step 209 and step 210 are executed by the feature amount updating unit 107 shown in FIG. First, in step 209, the reliability of the document vector V ′ _ij obtained in step 208 is calculated, and the vector size of the document vector is corrected according to the result.

本実施例において信頼度とは、ドキュメントベクトルＶ’_ｉｊに共通要素Ｖ_ｉｊの要素がどれだけ含まれているかを数値化した指標である。信頼度の算出としては、例えばドキュメントベクトルＶ’_ｉｊがドキュメントベクトルの組Ｖ_ｉ，Ｖ_ｊの共通要素Ｖ_ｉｊの要素をいくつ含んでいるかを数え上げ、共通要素Ｖ_ｉｊの要素数で割る方法が挙げられる。その他にも、共通要素Ｖ_ｉｊの要素が重要度によって重み付けされている場合、重み付けされた重要度の高さに応じて信頼度を算出しても良い。何れにしろ、この信頼度が、所定の値より低い場合、得られたドキュメントベクトルＶ’_ｉｊのベクトルサイズを増減する等の信頼度のフィードバックを行う。 In the present embodiment, the reliability is an index obtained by quantifying how many elements of the common element V _ij are included in the document vector V ′ _ij . The calculation of reliability, for example, the document vector V _'ij is counting whether include any number of elements of the common elements V _ij pair V _i, V _j of document vectors, and a method of dividing the number of elements common elements V _ij It is done. In addition, when the elements of the common element V _ij are weighted by importance, the reliability may be calculated according to the weighted importance. In any case, when the reliability is lower than a predetermined value, feedback of reliability such as increasing or decreasing the vector size of the obtained document vector V ′ _ij is performed.

ステップ２１０では、共通要素Ｖ_ｉｊを生成した際のドキュメントベクトルＶ_ｉ，Ｖ_ｊを、特徴量記憶部１１２から削除し、新たに得られたドキュメントベクトルＶ’_ｉｊを特徴量記憶部１１２に記憶させる。 In step 210, the document vectors V _i and V _j when the common element V _ij is generated are deleted from the feature amount storage unit 112, and the newly obtained document vector V ′ _ij is stored in the feature amount storage unit 112. .

ステップ２１１では、本実施例のパッセージ分割方法のために、パッセージ更新部１０８にて、Ｖ_ｉ，Ｖ_ｊに対応する二つの文またはパッセージ候補を連結する。一度も連結されていない文は文記憶部１１０に記憶されている。文が連結された場合、連結前の文を文記憶部１１０から削除する。パッセージ候補と文が連結された場合、あるいはパッセージ候補同士が連結された場合には、連結前の文の削除のみならず、連結前のパッセージ候補をパッセージ記憶部１１３から削除する。連結された文またはパッセージ候補は新たなパッセージ候補としてパッセージ記憶部１１３に記憶する。 In step 211, for the passage dividing method of this embodiment, the passage update unit 108 concatenates two sentences or passage candidates corresponding to V _i and V _j . A sentence that has never been linked is stored in the sentence storage unit 110. When the sentence is connected, the sentence before connection is deleted from the sentence storage unit 110. When the passage candidate and the sentence are connected, or when the passage candidates are connected to each other, not only the sentence before connection but also the passage candidate before connection is deleted from the passage storage unit 113. The connected sentence or passage candidate is stored in the passage storage unit 113 as a new passage candidate.

本実施例のパッセージ分割方法、装置においては、図２のフローにおいて、ステップ２０４からステップ２１１を繰り返すことで、目的とするパッセージを作成する。そして、ステップ２０５において、二つのドキュメントベクトルの最大類似度が所定の閾値未満の場合、パッセージの作成を終了するため、ステップ２１２を実行する。 In the passage dividing method and apparatus of the present embodiment, the target passage is created by repeating step 204 to step 211 in the flow of FIG. In step 205, if the maximum similarity between the two document vectors is less than the predetermined threshold, step 212 is executed to end the creation of the passage.

ステップ２１２は、出力部１０９にて実行され、不明パッセージの判定とパッセージの出力を行うステップである。不明パッセージの判定方法の一例として、文またはパッセージ候補の中に含まれる形態素数を調べる方法がある。文またはパッセージ候補の中に含まれる形態素数が少ない場合、ドキュメントベクトルが適切に作成されず、連結が難しい場合がある。よって、スッテプ２１において、残された文またはパッセージ候補に含まれる形態素数がある閾値以下の場合、出力部４０９は、不明パッセージのラベルをつけて出力し、処理フローを終了する。 Step 212 is a step that is executed by the output unit 109 to determine the unknown passage and output the passage. As an example of a method for determining an unknown passage, there is a method for examining the number of morphemes included in a sentence or a passage candidate. When the number of morphemes contained in a sentence or passage candidate is small, a document vector may not be created properly and connection may be difficult. Therefore, when the number of morphemes included in the remaining sentence or passage candidate is equal to or smaller than a certain threshold in step 21, the output unit 409 outputs the unknown passage with a label, and ends the processing flow.

図３は本実施例において、ドキュメントベクトルの類似度に応じて、文が連結されていく様子を模式的に示した一例である。図２のステップ２０５における閾値は“１０”とする。
一度目の類似度算出結果が３０１である。結果３０１の中で最も類似度が高いのは、ａ_２とａ_３の組の類似度４０である。 FIG. 3 is an example schematically showing how sentences are connected according to the similarity of document vectors in this embodiment. The threshold value in step 205 in FIG.
The first similarity calculation result is 301. The highest similarity in the result 301 is the similarity 40 of the set of a ₂ and a ₃ .

よってこの組に対して図２のステップ２０５からステップ２１１の処理を行い、再度図２のステップ２０４に戻る。連結された結果をａ_２３と表す。同様に結果３０２ではｂ_１とｂ_２、結果３０３ではａ_１とａ_２３が類似度の最も高い組として選定され、図２のステップ２０５から図２のステップ２１１の処理が行われる。閾値を１０と設定したので、結果３０４で選ばれる組はなく、パッセージの作成が完了する。 Therefore, the process from step 205 to step 211 in FIG. The ligated Results are expressed as _{a 23.} Similarly, b ₁ and b ₂ are selected in the result 302, and a ₁ and a ₂₃ are selected as the set having the highest similarity in the result 303, and the processing from step 205 in FIG. 2 to step 211 in FIG. 2 is performed. Since the threshold is set to 10, there is no set selected in the result 304, and the creation of the passage is completed.

以上詳述した実施例１によれば、意味の近い文、すなわち、特徴量が似た文を含む複数のパッセージが、一つの文書に含まれる場合でも、複数のパッセージを正しく分割することが可能となり、更には、文書の自動要約や文書検索のための自動キーワード抽出など。 According to the first embodiment described above in detail, even when a plurality of passages including sentences having similar meanings, that is, sentences having similar feature values, are included in one document, the plurality of passages can be correctly divided. Furthermore, automatic summarization of documents and automatic keyword extraction for document retrieval.

実施例２は類似度計算に単語ベクトルを、類似文書検索にも単語ベクトルを用いたパッセージ分割方法、装置、及びプログラムの実施例である。
図４は実施例２に係るパッセージ分割装置の機能ブロック図である。同図のパッセージ分割装置のハードウェア構成も、実施例１の図１Ａの装置同様、図１Ｂに示したコンピュータ等で実現できることは言うまでもなく、ここではハードウェア構成の図示説明を省略する。 The second embodiment is an embodiment of a passage dividing method, apparatus, and program using a word vector for similarity calculation and a word vector for similar document search.
FIG. 4 is a functional block diagram of the passage dividing apparatus according to the second embodiment. It goes without saying that the hardware configuration of the passage dividing apparatus shown in the figure can also be realized by the computer shown in FIG. 1B as in the apparatus shown in FIG. 1A of the first embodiment.

入力部４０２と、文分割部４０３と、パッセージ更新部４０８と、出力部４０９と、文記憶部４１０と、特徴量記憶部４１２と、パッセージ記憶部４１３と、形態素解析部４１４とは実施例１の対応するブロックと共通であるので、実施例１と異なる、コーパス部４１１と、特徴量算出部４０４と、類似度計算部４０５と、検索クエリ生成部４０６と、特徴量更新部４０７についてのみ説明する。なお、形態素解析部４１４は特徴量算出部４０４に接続される。 The input unit 402, the sentence division unit 403, the passage update unit 408, the output unit 409, the sentence storage unit 410, the feature amount storage unit 412, the passage storage unit 413, and the morpheme analysis unit 414 are described in the first embodiment. Therefore, only the corpus unit 411, the feature amount calculation unit 404, the similarity calculation unit 405, the search query generation unit 406, and the feature amount update unit 407, which are different from the first embodiment, are described. To do. Note that the morpheme analysis unit 414 is connected to the feature amount calculation unit 404.

コーパス部４１１には、例えば新聞記事などのドキュメントの集合やシソーラス、あるいはその両方を用いる。 For the corpus 411, for example, a collection of documents such as newspaper articles, a thesaurus, or both are used.

特徴量算出部４０４は、文記憶部４１０から読み込んだ文に対し、形態素解析部４１４を用いて形態素解析を行い、文を単語ベクトルへ変換する。単語ベクトルの要素数が十分でない場合にはコーパス部４１１を使用して要素数を増やす方法が有効である。例えばコーパスとしてシソーラスを用いた場合、入力文から得られた各単語をクエリとして類義語を検索し、結果として得られた類義語を単語ベクトルに追加する。またコーパスとしてドキュメントの集合を用いた場合、入力文から得られた単語ベクトルに、コーパス内の各ドキュメントから抽出した単語ベクトルを追加することができる。 The feature quantity calculation unit 404 performs morpheme analysis on the sentence read from the sentence storage unit 410 using the morpheme analysis unit 414, and converts the sentence into a word vector. When the number of elements of the word vector is not sufficient, a method of increasing the number of elements using the corpus unit 411 is effective. For example, when a thesaurus is used as a corpus, synonyms are searched by using each word obtained from the input sentence as a query, and the resulting synonyms are added to the word vector. When a set of documents is used as the corpus, a word vector extracted from each document in the corpus can be added to the word vector obtained from the input sentence.

単語ベクトルの要素を追加する方法の他の例として、上位数件のドキュメントからＴＦＩＤＦ等を用いて重要語を抜き出し、単語ベクトルに追加する方法が挙げられる。これに限らず、他の方法で文に関連する単語を得て追加して、単語ベクトルの要素数を十分にしてもよい。そして、得られた単語ベクトルを特徴量記憶部４１２に記憶する。また検索クエリ生成部４０６から、制御部４０１を介して単語ベクトルが特徴量算出部４０４に与えられた場合も、同様の方法で単語ベクトルの要素数を拡充し、特徴量記憶部１１２に記憶すると共に、制御部４０１を介して特徴量更新部４０７へ単語ベクトルを出力する。 Another example of the method of adding word vector elements is a method of extracting important words from the top several documents using TFIDF or the like and adding them to the word vector. However, the number of elements of the word vector may be sufficient by obtaining and adding words related to the sentence by other methods. Then, the obtained word vector is stored in the feature amount storage unit 412. Also, when a word vector is given from the search query generation unit 406 to the feature amount calculation unit 404 via the control unit 401, the number of elements of the word vector is expanded by the same method and stored in the feature amount storage unit 112. At the same time, the word vector is output to the feature amount updating unit 407 via the control unit 401.

本実施例の類似度計算部４０５は、制御部４０１の指定に基づいて、二つの単語ベクトルを特徴量記憶部４１２から読み出し、二つの単語ベクトルの類似度を計算する。類似度の計算方法としては、例えば、上述したコサイン尺度等が挙げられる。 The similarity calculation unit 405 according to the present embodiment reads two word vectors from the feature amount storage unit 412 based on the designation of the control unit 401, and calculates the similarity between the two word vectors. Examples of the similarity calculation method include the cosine scale described above.

本実施例の検索クエリ生成部４０６は、制御部４０１の指定に基づいて、二つの単語ベクトルを特徴量記憶部４１２から読み出し、二つの単語ベクトルに共通する単語群をコーパス４１１から抽出する。抽出された共通する単語群から単語ベクトルを作成し、制御部４０１を介して特徴量算出部４０４に出力する。 The search query generation unit 406 of this embodiment reads two word vectors from the feature amount storage unit 412 based on the designation of the control unit 401 and extracts a word group common to the two word vectors from the corpus 411. A word vector is created from the extracted common word group, and is output to the feature amount calculation unit 404 via the control unit 401.

特徴量更新部４０７は、制御部４０１の指定に基づいて二つの単語ベクトルＶ_ｉ，Ｖ_ｊを特徴量記憶部４１２から読み出す。また制御部４０１から一つの単語ベクトルＶ_ｋが入力される。入力された三つの単語ベクトルＶ_ｋ，Ｖ_ｉ，Ｖ_ｊから信頼度を計算し、信頼度に基づいてＶ_ｋのベクトルサイズを修正する。その後Ｖ_ｉ，Ｖ_ｊを特徴量記憶部４１２から削除し、Ｖ_ｋを特徴量記憶部４１２に記憶する。 The feature amount update unit 407 reads two word vectors V _i and V _j from the feature amount storage unit 412 based on the designation of the control unit 401. Also, one word vector V _k is input from the control unit 401. The reliability is calculated from the three input word vectors V _k , V _i , and V _j, and the vector size of V _k is corrected based on the reliability. Thereafter, V _i and V _j are deleted from the feature amount storage unit 412, and V _k is stored in the feature amount storage unit 412.

図５は実施例２に係るプログラムの動作を示した処理フロー図である。実施例１では、類似度計算としてドキュメントベクトルを用いているが、実施例２では上述の通り、単語ベクトルを用いており、その点が実施例１と異なるが、それ以外の動作は実施例１と同様である。 FIG. 5 is a processing flowchart illustrating the operation of the program according to the second embodiment. In the first embodiment, a document vector is used for similarity calculation. In the second embodiment, as described above, a word vector is used, which differs from that in the first embodiment, but the other operations are the same as in the first embodiment. It is the same.

実施例２によれば、意味の近い文、即ち、特徴量が似た文を含む複数のパッセージが、一つの文書に含まれる場合でも、パッセージを正しく分割することが可能となる。 According to the second embodiment, even when a plurality of passages including sentences having similar meanings, that is, sentences having similar feature amounts, are included in one document, the passages can be correctly divided.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したのであり、必ずしも説明の全ての構成を備えるものに限定されものではない。また、ある実施例の構成に他の実施例の構成を加えることが可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to an above-described Example, Various modifications are included. For example, the above-described embodiments have been described in detail in order to explain the present invention in an easy-to-understand manner, and are not necessarily limited to those having all the configurations described. Moreover, it is possible to add the structure of another Example to the structure of a certain Example. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、それぞれの機能を実現するプログラムを実行することによりソフトウェアで実現する場合を例示して説明したが、各機能を実現するプログラム、テーブル、ファイル等の情報はメモリのみならず、ハードディスク、ＳＳＤ（Solid State DriＶe）等の記憶装置、または、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体におくことができるし、必要に応じてネットワーク等を介してダウンロード、インストールすることも可能である。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. In addition, each configuration, function, and the like have been described by exemplifying a case where they are realized by software by executing a program that realizes each function. However, information on programs, tables, files, and the like that realize each function It can be stored not only in memory but also in storage devices such as hard disks and SSDs (Solid State DriVe), or recording media such as IC cards, SD cards, and DVDs, and can be downloaded and installed via a network or the like as necessary. It is also possible to do.

１１ＣＰＵ
１２記憶部
１３入出力部
１４通信部
１００、４００パッセージ分割装置
１０１、４０１制御部
１０２、４０２入力部
１０３、４０３文分割部
１０４、４０４特徴量算出部
１０５、４０５類似度計算部
１０６、４０６検索クエリ生成部
１０７、４０７特徴量更新部
１０８、４０８パッセージ更新部
１０９、４０９出力部
１１０、４１０文記憶部
１１１、４１１コーパス部
１１２、４１２特徴量記憶部
１１３、４１３パッセージ記憶部
１１４、４１４形態素解析部 11 CPU
12 storage unit 13 input / output unit 14 communication unit 100, 400 passage division device 101, 401 control unit 102, 402 input unit 103, 403 sentence division unit 104, 404 feature amount calculation unit 105, 405 similarity calculation unit 106, 406 search Query generation unit 107, 407 Feature amount update unit 108, 408 Passage update unit 109, 409 Output unit 110, 410 Sentence storage unit 111, 411 Corpus unit 112, 412 Feature amount storage unit 113, 413 Passage storage unit 114, 414 Morphological analysis Part

Claims

A passage dividing method for dividing a document into passages by a processing unit,
The processor is
Dividing the document into sentence units,
Using the divided sentence as a query, extracting a related document from a plurality of previously stored documents, creating a feature amount,
Updating the feature amount using a common element of the two feature amounts, wherein the similarity between the two feature amounts of the created feature amounts is equal to or greater than a predetermined threshold;
A passage dividing method characterized by the above.

The passage dividing method according to claim 1,
The processor is
A document vector is used as the feature amount.
A passage dividing method characterized by the above.

The passage dividing method according to claim 2,
The processor is
When the similarity between two document vectors V _i and V _j which are the two feature quantities is equal to or greater than a predetermined threshold, a common element V _ij of the two document vectors V _i and V _j is extracted and a search query is obtained. Generate,
A passage dividing method characterized by the above.

The passage dividing method according to claim 3,
The processor is
A new document vector V ′ _ij is obtained using the generated search query.
A passage dividing method characterized by the above.

It is the passage division | segmentation method of Claim 4, Comprising:
The processor is
The new document vector V _'ij is the corresponding to the degree that contains the elements of the common elements V _ij, the new document vector V' to modify the vector size _ij,
A passage dividing method characterized by the above.

It is the passage division | segmentation method of Claim 4, Comprising:
The processor is
Concatenating the sentence or passage candidate corresponding to the new document vector V ′ _ij into a new passage candidate;
A passage dividing method characterized by the above.

The passage dividing method according to claim 1,
The processor is
A word vector is used as the feature amount.
A passage dividing method characterized by the above.

It is the passage division | segmentation method of Claim 7, Comprising:
When the similarity between the two word vectors V _i and V _j that are the two feature quantities is equal to or greater than a predetermined threshold, a common element V _ij of the two word vectors V _i and V _j is extracted, and a search query is obtained. Generate
A new word vector V ′ _ij is obtained using the generated search query.
A passage dividing method characterized by the above.

The passage dividing method according to claim 8,
The processor is
The new word vector V _'ij is, in response to the degree that contains the elements of the common element V _ij, the new word vector V' to modify the vector size _ij,
A passage dividing method characterized by the above.

It is the passage division | segmentation method of Claim 9, Comprising:
The processor is
Concatenating the sentence or passage candidate corresponding to the new word vector V ′ _ij into a new passage candidate;
A passage dividing method characterized by the above.

A passage dividing device for dividing an input document into passages,
A processing unit and a storage unit;
The processor is
Dividing the document into sentence units,
Using the divided sentence as a query, extracting a related document from a plurality of documents stored in the storage unit in advance, creating a feature amount,
Updating the feature value using a common element of the feature value, wherein two similarities of the created feature values are equal to or greater than a predetermined threshold;
Passage dividing apparatus characterized by the above.

The passage dividing apparatus according to claim 11,
The processor is
As the feature amount, a document vector or a word vector based on the related document is used.
Passage dividing apparatus characterized by the above.

The passage dividing apparatus according to claim 12, wherein
The processor is
When the similarity between two document vectors or word vectors V _i and V _j that are the two feature quantities is equal to or greater than a predetermined threshold, the common elements of the two document vectors or word vectors V _i and V _j _Extract V _ij , generate a search query,
A new document vector or word vector V ′ _ij is obtained using the generated search query,
The new document vector, or word vector V 'is _ij, the common element in response to the degree that contains the elements of V _ij, the new document vector, or word vector V' to modify the vector size _ij,
Passage dividing apparatus characterized by the above.

The passage dividing apparatus according to claim 13,
The processor is
Concatenating the sentence or passage candidate corresponding to the new document vector V ′ _ij and storing the newly connected passage candidate in the storage unit;
Passage dividing apparatus characterized by the above.

A passage dividing program that includes a processing unit and a storage unit, and that is executed by a processing unit of a passage dividing device that divides a passage of an input document,
The processing unit is
Dividing the document into sentence units,
Using the divided sentences as queries, extracting related documents from a plurality of documents stored in the storage unit in advance,
Create a feature using the extracted related document,
Updating the feature value using a common element of the feature value, wherein two similarities of the created feature values are equal to or greater than a predetermined threshold;
Make it work,
A passage dividing program characterized by that.