JP2013101679A

JP2013101679A - Text segmentation device, method, program, and computer-readable recording medium

Info

Publication number: JP2013101679A
Application number: JP2013015670A
Authority: JP
Inventors: Naoto Abe; 直人阿部; Toshiro Uchiyama; 俊郎内山; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-01-30
Filing date: 2013-01-30
Publication date: 2013-05-23

Abstract

PROBLEM TO BE SOLVED: To realize text segmentation using web retrieval that enables text segmentation without requiring learning data.SOLUTION: A text segmentation device divides an input text into sentence units, subjects the divided sentences to morphological analysis, extracts all words except particles, subjected to the morphological analysis as retrieval words, converts the words having inflected forms into the words having end forms, subjects a text obtained by web retrieval on the basis of the retrieval words to the morphological analysis, extracts all the words except the particles as related words, converts the words having inflected forms into the words having end forms, determines semantic paragraphs on the basis of connectivity between the sentences by using a set of keywords which are combinations of the retrieval words and the related words stored in related word storage means, and creates division candidates, and evaluates the division candidates to select one division result to output the result.

Description

本発明は、テキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体に係り、特に、テキストを計算機上で利用する分野において、テキストに記述されている複数の内容に応じてテキストを自動的に分割するテキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体に関する。 The present invention relates to a text segmentation apparatus and method, a program, and a computer-readable recording medium. In particular, in a field where text is used on a computer, the text is automatically generated according to a plurality of contents described in the text. The present invention relates to a text segmentation apparatus and method, a program, and a computer-readable recording medium.

近年急速な計算機の性能向上に伴い莫大なテキスト（ここでは、文字列だけで構成される文の集合）を蓄積し、データベースを構築することが可能になった。しかし、保存されたテキストを人手で整理・管理することは一般的に困難となってきている。そこで、蓄積されたテキストデータベースを解析し、テキストを意味的な内容（意味段落と呼ぶ）に応じて分割するテキストセグメンテーションと呼ばれる技術が開発されており、テキストデータベースの分類や整理を計算機で自動的に行うことに応用されつつある。例えば、概念ベースを呼ばれる情報を用いてテキストセグメンテーションを行う技術がある。この技術では、ある単語とそれに共起するパターンを数値ベクトル化した概念ベクトルを予め蓄積した学習データから複数作成する。そして、概念ベクトルの集まりである概念ベースを利用してテキストセグメンテーションを行う。学習データは複数の分野に関する（例えば、「政治」「経済」「科学」の分野だけに関する）テキストが数多く蓄積されている（例えば、特許文献１参照）。
また、従来のテキストセグメンテーションでは、複数の文間に対する連結度に基づいて文間の意味的連続性を評価する方法が主である（例えば、非特許文献１参照）。この例として、連結度を算出する際に考慮する文の個数が少ない場合には、局所的な意味内容の変化に追従し易い代わりに、過剰に意味段落を推定する可能性が増える。一方で、考慮する文の個数が多い場合には、大域的な意味内容の変化を捉えることができる代わりに、緩やかに意味内容が変化するテキストに対して対処することが難しい With the rapid improvement in computer performance in recent years, it has become possible to build a database by accumulating enormous text (here, a set of sentences consisting only of character strings). However, it is generally difficult to manually organize and manage stored text. Therefore, a technique called text segmentation has been developed that analyzes the stored text database and divides the text according to semantic content (called semantic paragraphs), and automatically classifies and organizes the text database with a computer. It is being applied to. For example, there is a technique for performing text segmentation using information called a concept base. In this technology, a plurality of concept vectors obtained by numerically vectorizing a certain word and a pattern co-occurring with it are created from learning data stored in advance. Then, text segmentation is performed using a concept base that is a collection of concept vectors. In the learning data, many texts related to a plurality of fields (for example, only about the fields of “politics”, “economy”, and “science”) are accumulated (for example, see Patent Document 1).
Further, in the conventional text segmentation, a method for evaluating semantic continuity between sentences based on the degree of connectivity between a plurality of sentences is mainly used (for example, see Non-Patent Document 1). As an example of this, when the number of sentences to be considered when calculating the connectivity is small, the possibility of excessively estimating semantic paragraphs increases instead of easily following changes in local semantic content. On the other hand, when there are a large number of sentences to consider, it is difficult to deal with text that changes slowly in meaning instead of capturing global changes in meaning.

特許第３７７５２３９号公報Japanese Patent No. 3775239

Hearst. M. A., : Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16 (1994)Hearst. M. A.,: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16 (1994)

従来のテキストセグメンテーション手法の精度を高めるためには、大規模な学習データを用意しなくてはならない。そのため、学習データが小規模な場合には概念ベースを適切に作成できず、テキストセグメンテーションの精度が低下する問題がある。また、事前に用意した学習データに含まれている分野に対応できる反面、異なる分野のテキストに対してテキストセグメンテーションを行うことができない。例えば、学習データに「政治」や「経済」に関する情報だけが蓄積されている場合、「スポーツ」の分野のテキストに対してテキストセグメンテーションは困難となる。
本発明は、上記の点に鑑みなされたもので、学習データを必要とせずに、テキストセグメンテーション可能なテキストセグメンテーション装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体を提供することを目的とする。 In order to improve the accuracy of the conventional text segmentation technique, large-scale learning data must be prepared. Therefore, when the learning data is small, the concept base cannot be appropriately created, and there is a problem that the accuracy of text segmentation is lowered. In addition, while it is possible to correspond to the fields included in the learning data prepared in advance, text segmentation cannot be performed on texts in different fields. For example, when only information related to “politics” and “economy” is accumulated in the learning data, text segmentation is difficult for text in the field of “sports”.
The present invention has been made in view of the above points, and an object thereof is to provide a text segmentation apparatus and method, a program, and a computer-readable recording medium capable of text segmentation without requiring learning data.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、テキストを内容に応じて分割するテキストセグメンテーション装置であって、
入力されたテキストを文単位に分割し、分割文章記憶手段２０２に格納するテキスト分解手段２０１と、
テキスト分解手段２０１により分割された文を形態素解析し、形態素解析された単語の中から少なくとも助詞を除き、さらに、予め作成された一般語リストに登録された単語を除くことにより検索語を抽出し、検索語記憶手段２１２に格納する検索語抽出手段２１１と、
検索語に基づいてウェブ検索し、検索されたテキストを形態素解析し、形態素解析された単語の中から少なくとも助詞を除き、さらに、予め作成された一般語リストに登録された単語を除くことにより関連語を抽出し、関連語記憶手段２２２に格納する関連語取得手段２２１と、
検索語記憶手段２１２に格納されている検索語と関連語記憶手段２２２に格納されている関連語との組み合わせであるキーワード集合を用いて、分割文章記憶手段２０２に格納されている文同士の連結性に基づいて意味段落を求め、分割候補を作成し分割候補記憶手段２４２に格納する分割候補生成手段２３１と、
分割候補記憶手段２４２に格納されている分割候補を評価して一つの分割結果を選択して出力する分割結果評価手段２４１と、を有し、
分割結果評価手段２４１において、
分割候補記憶手段２４２に格納されている分割候補の意味段落に含まれる文の範囲内において、キーワード集合を参照して、各キーワードの出現頻度を求め、該出現頻度に基づいて、該分割候補記憶手段に格納されている全ての分割候補を評価して評価値を求め、該評価値が最小となる分割候補を選択する手段を含む。 The present invention (Claim 1) is a text segmentation device that divides text according to content,
A text decomposing unit 201 that divides the input text into sentence units and stores the divided text in the divided sentence storage unit 202;
The sentence divided by the text decomposition means 201 is subjected to morphological analysis, and at least a particle is removed from the words subjected to morphological analysis, and further, a search word is extracted by removing a word registered in a general word list created in advance. , A search term extraction unit 211 stored in the search term storage unit 212,
Web search based on search terms, morphological analysis of the searched text, remove at least particles from the morphologically analyzed words, and further remove the words registered in the general word list created in advance Related word acquisition means 221 that extracts words and stores them in the related word storage means 222;
Using a keyword set that is a combination of a search word stored in the search word storage unit 212 and a related word stored in the related word storage unit 222, the sentences stored in the divided text storage unit 202 are connected to each other. A division candidate generation unit 231 that obtains a semantic paragraph based on the property, creates a division candidate, and stores the division candidate in the division candidate storage unit 242;
Division result evaluation means 241 that evaluates the division candidates stored in the division candidate storage means 242 and selects and outputs one division result;
In the division result evaluation means 241,
The frequency of appearance of each keyword is obtained by referring to the keyword set within the range of sentences included in the meaning paragraph of the candidate for division stored in the candidate division storage unit 242, and the candidate storage for division is determined based on the appearance frequency. Means for evaluating all the division candidates stored in the means to obtain an evaluation value, and selecting a division candidate having the smallest evaluation value;

また、本発明（請求項２）は、分割結果評価手段２４１において、
評価値を求める際に、入力されたテキストを細かく分割する程小さい値をとる第１の指標と、意味段落間で内容が異なる程小さい値をとる第２の指標を求め、該第１の指標と該第２の指標の和を評価値とする。
また、本発明（請求項３）は、分割候補生成手段２３１において、
キーワード集合を前後の複数の文で比較し、内容的にまとまっている一文または複数の文から構成される意味段落を求める意味段落生成手段を有し、
意味段落生成手段は、
キーワード集合を纏めたブロックＢ１，Ｂ２を作成し、ｉ番目とｉ＋１番目の２つの文の連結度Ｃ_ｉ ^ｂを、単語ｔの出現頻度を用いて、 Further, the present invention (Claim 2) is provided in the division result evaluation means 241,
When obtaining an evaluation value, a first index that takes a smaller value as the input text is finely divided and a second index that takes a smaller value as the contents differ between semantic paragraphs are obtained, and the first index And the sum of the second index as an evaluation value.
Further, according to the present invention (Claim 3), in the division candidate generation means 231,
Meaning paragraph generating means for comparing a keyword set with a plurality of sentences before and after and obtaining a meaning paragraph composed of one sentence or a plurality of sentences that are grouped in content,
Meaning paragraph generation means
Create a block B1, B2 summarizes the keyword set, the degree of coupling C _i ^b of the i-th and (i + 1) th two sentences, using the occurrence frequency of the word t,

（但し、ｗ_ｔ ^B1はブロックＢ１にある単語ｔの頻度、ｗ_ｔ ^B2はブロックＢ２にある単語ｔの頻度を表す。Ｃ_ｉ ^ｂは０以上１以下の値を取り、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す）
により求める手段と、
ｉ＝｛１，２，…，Ｎ｝と変化させ、 (W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2. C _i ^b takes a value from 0 to 1, and the closer to 1, the block B1 And the words contained in block B2 are the same)
Means to obtain
i = {1, 2,..., N},

を計算し、ブロックの大きさｂのパラメータをｂ＝（ｂ_１，ｂ_２，…，ｂ_Ｍ）とＭ個設定して各ブロック幅に対して連結度Ｃ_ｉ ^ｂを計算し、それらの平均値をｉ番目とｉ＋１番目の文における平均連結度Ｃ_ｉを、
And the block size b parameter is set as b = (b ₁ , b ₂ ,..., B _M ), the connectivity C _i ^b is calculated for each block width, and the average of them is calculated. The average connectivity C _i in the i-th and i + 1-th sentences is expressed as

により求める手段と、
平均連結度Ｃ_ｉ（但し、ｉ＝（１，２，…，Ｎ））を用いて意味段落の境界である平均連結度の谷を、条件
Means to obtain
The average connectivity C _i (where i = (1, 2,..., N)) is used to define the average connectivity valley that is the boundary of the semantic paragraph as a condition.

に基づいて抽出し、該谷に基づいて意味段落を取得する手段と、を含む。
And a means for obtaining a semantic paragraph based on the valley.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項４）は、テキストを内容に応じて分割するテキストセグメンテーション方法であって、
テキスト分解手段が、入力されたテキストを文単位に分割し、分割文章記憶手段に格納するテキスト分解ステップ（ステップ１）と、
検索語抽出手段が、テキスト分解ステップ（ステップ１）で分割された文を形態素解析し、形態素解析された単語の中から少なくとも助詞を除き、さらに、予め作成された一般語リストに登録された単語を除くことにより検索語を抽出し、検索語記憶手段に格納する検索語抽出ステップ（ステップ２）と、
関連語取得手段は、検索語に基づいてウェブ検索し、検索されたテキストを形態素解析し、形態素解析された単語の中から少なくとも助詞を除き、さらに、予め作成された一般語リストに登録された単語を除くことにより関連語を抽出し、関連語記憶手段に格納する関連語取得ステップ（ステップ３）と、
分割候補生成手段が、検索語記憶手段に格納されている検索語と関連語記憶手段に格納されている関連語との組み合わせであるキーワード集合を用いて、分割文章記憶手段に格納されている文同士の連結性に基づいて意味段落を求め、分割候補を作成し分割候補記憶手段に格納する分割候補生成ステップ（ステップ４）と、
分割結果評価手段が、分割候補記憶手段に格納されている分割候補を評価して一つの分割結果を選択して出力する分割結果評価ステップ（ステップ５）と、を行い、
分割結果評価ステップ（ステップ５）において、
分割候補記憶手段に格納されている分割候補の意味段落に含まれる文の範囲内において、キーワード集合を参照して、各キーワードの出現頻度を求め、該出現頻度に基づいて、該分割候補記憶手段に格納されている全ての分割候補を評価して評価値を求め、該評価値が最小となる分割候補を選択する。 The present invention (Claim 4) is a text segmentation method for dividing text according to content,
A text decomposition step (step 1) in which the text decomposition means divides the inputted text into sentence units and stores the divided text in the divided sentence storage means;
The search word extraction means performs morphological analysis on the sentence divided in the text decomposition step (step 1), removes at least particles from the words subjected to morphological analysis, and further registers words in a general word list created in advance. A search term extraction step (step 2) for extracting a search term by removing and storing in the search term storage means;
The related word acquisition means searches the web based on the search word, performs morphological analysis on the searched text, removes at least particles from the morphologically analyzed words, and is registered in a general word list created in advance. A related word acquisition step (step 3) of extracting a related word by removing the word and storing it in the related word storage means;
The sentence stored in the divided sentence storage means by the division candidate generation means using a keyword set that is a combination of the search word stored in the search word storage means and the related words stored in the related word storage means A division candidate generation step (step 4) for obtaining a semantic paragraph based on the connectivity between each other, creating a division candidate and storing the division candidate in a division candidate storage unit;
A division result evaluation unit performs a division result evaluation step (step 5) for evaluating a division candidate stored in the division candidate storage unit and selecting and outputting one division result;
In the division result evaluation step (step 5),
Within the range of sentences included in the meaning paragraph of the division candidate stored in the division candidate storage means, the appearance frequency of each keyword is obtained by referring to the keyword set, and based on the appearance frequency, the division candidate storage means Are evaluated to obtain an evaluation value, and a division candidate having the smallest evaluation value is selected.

また、本発明（請求項５）は、分割結果評価ステップ（ステップ５）において、
評価値を求める際に、入力されたテキストを細かく分割する程小さい値をとる第１の指標と、意味段落間で内容が異なる程小さい値をとる第２の指標を求め、該第１の指標と該第２の指標の和を評価値とする。
また、本発明（請求項６）は、分割候補生成ステップ（ステップ４）において、
キーワード集合を前後の複数の文で比較し、内容的にまとまっている一文または複数の文から構成される意味段落を求める意味段落生成ステップを行い、
意味段落生成ステップは、
キーワード集合を纏めたブロックＢ１，Ｂ２を作成し、ｉ番目とｉ＋１番目の２つの文の連結度Ｃ_ｉ ^ｂを、単語ｔの出現頻度を用いて、 Further, according to the present invention (Claim 5), in the division result evaluation step (Step 5),
When obtaining an evaluation value, a first index that takes a smaller value as the input text is finely divided and a second index that takes a smaller value as the contents differ between semantic paragraphs are obtained, and the first index And the sum of the second index as an evaluation value.
Further, according to the present invention (Claim 6), in the division candidate generation step (Step 4),
Performing a semantic paragraph generation step for comparing a keyword set with a plurality of sentences before and after, and obtaining a semantic paragraph composed of one sentence or a plurality of sentences,
The semantic paragraph generation step
Create a block B1, B2 summarizes the keyword set, the degree of coupling C _i ^b of the i-th and (i + 1) th two sentences, using the occurrence frequency of the word t,

（但し、ｗ_ｔ ^B1はブロックＢ１にある単語ｔの頻度、ｗ_ｔ ^B2はブロックＢ２にある単語ｔの頻度を表す。Ｃ_ｉ ^ｂは０以上１以下の値を取り、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す）
により求めるステップと、
ｉ＝｛１，２，…，Ｎ｝と変化させ、
(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2. C _i ^b takes a value from 0 to 1, and the closer to 1, the block B1 And the words contained in block B2 are the same)
A step to obtain by
i = {1, 2,..., N},

により求めるステップと、
平均連結度Ｃ_ｉ（但し、ｉ＝（１，２，…，Ｎ））を用いて意味段落の境界である平均連結度の谷を、条件
A step to obtain by
The average connectivity C _i (where i = (1, 2,..., N)) is used to define the average connectivity valley that is the boundary of the semantic paragraph as a condition.

に基づいて抽出し、該谷に基づいて意味段落を取得するステップと、を行う。
And extracting a semantic paragraph based on the valley.

本発明（請求項７）は、請求項１乃至３のいずれか１項に記載のテキストセグメンテーション装置を構成する各手段としてコンピュータを機能させるためのテキストセグメンテーションプログラムである。 The present invention (Claim 7) is a text segmentation program for causing a computer to function as each means constituting the text segmentation apparatus according to any one of Claims 1 to 3.

本発明（請求項８）は、請求項７記載のテキストセグメンテーションプログラムを格納したコンピュータ読取可能な記録媒体である。 The present invention (Claim 8) is a computer-readable recording medium storing the text segmentation program according to Claim 7.

本発明は、学習データを必要とせずにテキストセグメンテーションを行うために、検索語を用いてウェブ上での検索を利用することで、文の内容に関する複数の単語を取得できる点着目している。現在、ウェブ上には膨大な情報が蓄積されており、最新の話題も常に提供されている。つまり、ウェブは様々な情報を持つ記事の集合として捉えることができる。実際、我々はあることに関して調べる際、検索サイトで検索語を入力してウェブ上で検索を行い、単語の意味や物事の内容を調べている。その観点から、学習データを使用しなくともウェブ上にある情報を適切に利用すれば、「サッカー」や「野球」に対応するのは「スポーツ」や「ボール」という概念を取得できると言える。つまり、ウェブ上にある様々な情報を基にテキストの内容に応じた単語を取得し、文同士の関連性を単語の変化によって追跡することで意味段落を分割することができる。その結果、テキストの内容を学習データを使用しなくとも把握することが可能となる。 The present invention focuses on the fact that a plurality of words relating to the content of a sentence can be acquired by using a search on the web using a search term in order to perform text segmentation without requiring learning data. Currently, a huge amount of information is accumulated on the web, and the latest topics are always provided. In other words, the web can be considered as a collection of articles with various information. In fact, when we look into something, we enter a search term on a search site and search the web to find out what the word means and what it does. From this point of view, it can be said that the concept of “sports” and “ball” can be acquired to support “soccer” and “baseball” by appropriately using information on the web without using learning data. That is, it is possible to divide a semantic paragraph by acquiring words corresponding to the contents of text based on various information on the web and tracking the relationship between sentences by the change of words. As a result, it is possible to grasp the contents of the text without using learning data.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の一実施の形態におけるウェブ検索を利用したテキストセグメンテーション装置の構成図である。It is a block diagram of the text segmentation apparatus using the web search in one embodiment of this invention. 本発明の一実施の形態における概要動作のフローチャートである。It is a flowchart of the outline | summary operation | movement in one embodiment of this invention. 本発明の一実施の形態におけるテキストの例である。It is an example of the text in one embodiment of this invention. 本発明の一実施の形態における分解文章記憶部に格納された文の例である。It is an example of the sentence stored in the decomposition | disassembly sentence memory | storage part in one embodiment of this invention. 本発明の一実施の形態における一般語リストに登録されている一般語の例である。It is an example of the general word registered into the general word list | wrist in one embodiment of this invention. 本発明の一実施の形態における検索語記憶部に格納された検索語の例である。It is an example of the search word stored in the search word memory | storage part in one embodiment of this invention. 本発明の一実施の形態における関連語抽出部の処理手順のフローチャートである。It is a flowchart of the process sequence of the related word extraction part in one embodiment of this invention. 本発明の一実施の形態における関連語記憶部に格納された関連語の例である。It is an example of the related word stored in the related word memory | storage part in one embodiment of this invention. 本発明の一実施の形態におけるキーワード集合記憶部に格納されたキーワード集合の例である。It is an example of the keyword set stored in the keyword set memory | storage part in one embodiment of this invention. 本発明の一実施の形態における分割候補生成部の処理手順のフローチャートである。It is a flowchart of the process sequence of the division | segmentation candidate production | generation part in one embodiment of this invention. 本発明の一実施の形態における平均連結度の算出例である。It is an example of calculation of the average connectivity in one embodiment of the present invention. 本発明の一実施の形態における分割候補記憶部に格納された分割候補の例である。It is an example of the division candidate stored in the division candidate memory | storage part in one embodiment of this invention. 本発明の一実施の形態における分割結果評価部の処理手順のフローチャートである。It is a flowchart of the process sequence of the division | segmentation result evaluation part in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。
図３は、本発明の一実施の形態におけるセグメンテーション装置の構成を示す。当該セグメンテーション装置は、コンピュータ２６０で実現されるものである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 3 shows the configuration of the segmentation apparatus in one embodiment of the present invention. The segmentation device is realized by a computer 260.

セグメンテーション装置は、当該装置を制御する制御部２５０、テキスト２６４を入力する入力部２５１、テキストを文単位に分割するテキスト分解部２０１、分解文章記憶部２０２、検索語を抽出する検索語抽出部２１１、検索語記憶部２１２、関連語を取得する関連語取得部２２１、関連語記憶部２２２、検索語と関連語とを組み合わせたキーワード集合を用いて意味段落を抽出し、分割候補を生成する分割候補生成部２３１、キーワード集合記憶部２３２、分割候補を評価し、ひとつの分割結果を選択する分割結果評価部２４１、分割候補記憶部２４２、抽出した意味段落をテキストの分割結果として出力する出力部２５２から構成される。 The segmentation device includes a control unit 250 that controls the device, an input unit 251 that inputs text 264, a text decomposition unit 201 that divides text into sentence units, a decomposed sentence storage unit 202, and a search word extraction unit 211 that extracts search terms. , Search word storage unit 212, related word acquisition unit 221 for acquiring related words, related word storage unit 222, extraction of semantic paragraphs using keyword set combining search terms and related words, and division for generating division candidates Candidate generation unit 231, keyword set storage unit 232, division result evaluation unit 241 that evaluates division candidates and selects one division result, division candidate storage unit 242, and output unit that outputs the extracted semantic paragraph as a text division result 252.

上記の構成を有するセグメンテーション装置（コンピュータ２６０）には、ネットワーク２６１が接続されており、ウェブ２６２にアクセスできる。ウェブ２６２には複数のＨＴＭＬやＸＭＬ等の構造化言語で記述された記事２６３が蓄積されている。テキスト２６４はコンピュータ２６０の入力部２５１に入力されるテキストである。表示部２６５は、制御部２５０からの出力部２５２を通じて出力された結果を表示するための装置である。 A network 261 is connected to the segmentation apparatus (computer 260) having the above-described configuration, and the web 262 can be accessed. The web 262 stores a plurality of articles 263 described in a structured language such as HTML or XML. Text 264 is text input to the input unit 251 of the computer 260. The display unit 265 is a device for displaying the result output through the output unit 252 from the control unit 250.

上記の構成において、分解文章記憶部２０２、検索語記憶部２１２、関連語記憶部２２２、キーワード集合記憶部２３２、分割候補記憶部２４２、一般語リスト記憶部５０１は、ハードディスク等の記憶媒体である。分割文章記憶部２０２は、テキスト分解処理部２０１で文単位に分解された文を格納する。検索語記憶部２１２は、検索語抽出部２１１で抽出された検索語を格納する。関連語記憶部２２２は、関連語取得部２２１で得られた関連語を格納する。キーワード集合記憶部２３２は、分割候補生成部２３１で作成されたキーワード集合を格納する。分割候補記憶部２４２は、分割結果評価部２４１で抽出された分割候補を格納する。一般語リスト記憶部５０１は、検索語抽出部２１１から参照される一般語の集合を格納する。 In the above configuration, the decomposed sentence storage unit 202, the search word storage unit 212, the related word storage unit 222, the keyword set storage unit 232, the division candidate storage unit 242, and the general word list storage unit 501 are storage media such as a hard disk. . The divided sentence storage unit 202 stores the sentence decomposed into sentence units by the text decomposition processing unit 201. The search word storage unit 212 stores the search word extracted by the search word extraction unit 211. The related word storage unit 222 stores the related words obtained by the related word acquisition unit 221. The keyword set storage unit 232 stores the keyword set created by the division candidate generation unit 231. The division candidate storage unit 242 stores the division candidates extracted by the division result evaluation unit 241. The general word list storage unit 501 stores a set of general words referred to by the search word extraction unit 211.

次に、上記の構成における動作の概要を説明する。 Next, an outline of the operation in the above configuration will be described.

図４は、本発明の一実施の形態における概要動作のフローチャートである。 FIG. 4 is a flowchart of an outline operation in one embodiment of the present invention.

入力部２５１によりテキスト２６４が入力されると（ステップ１１０）、テキスト分割部２０１において入力されたテキストを文単位に分割し、分解文章記憶部２０２に格納する（ステップ１２０）。検索語抽出部２１１において、分解文章記憶部２０２の各文に対して検索語となる単語を抽出し、検索語記憶部２１２に格納する（ステップ１３０）。次に、関連語取得部２２１は、検索語記憶部２１２に格納されている検索語を利用してウェブ２６２上を検索し、取得した検索結果を関連語として関連語記憶部２２２に格納する（ステップ１４０）。分割候補生成部２３１は、検索語記憶部２１２に格納されている検索語と、関連語記憶部２２２に格納されている関連語からキーワード集合を作成し、キーワード集合記憶部２３２に格納すると共に、当該キーワード集合を用いて分割候補を生成し、分割候補記憶部２４２に格納する（ステップ１５０）。分割結果評価部２４１において、分割候補記憶部２４２から分割候補を取得し、当該分割候補の中から評価関数の値が最小となる結果を選択する（ステップ１６０）。出力部２５２は、選択された結果をテキストセグメンテーション結果として出力する（ステップ１７０）。 When the text 264 is input by the input unit 251 (step 110), the text input by the text dividing unit 201 is divided into sentence units and stored in the decomposed sentence storage unit 202 (step 120). The search word extraction unit 211 extracts a search word for each sentence in the decomposed text storage unit 202 and stores it in the search word storage unit 212 (step 130). Next, the related word acquisition unit 221 searches the web 262 using the search word stored in the search word storage unit 212, and stores the acquired search result in the related word storage unit 222 as a related word ( Step 140). The division candidate generation unit 231 creates a keyword set from the search words stored in the search word storage unit 212 and the related words stored in the related word storage unit 222, stores the keyword sets in the keyword set storage unit 232, and A division candidate is generated using the keyword set and stored in the division candidate storage unit 242 (step 150). The division result evaluation unit 241 acquires a division candidate from the division candidate storage unit 242, and selects a result having the smallest evaluation function value from the division candidates (step 160). The output unit 252 outputs the selected result as a text segmentation result (step 170).

以下に、上記の図４に示す各ステップの動作を具体的に説明する。なお、上記の図３の構成において制御部２５０が含まれるが、以下の説明では各処理を行う構成要素のそれぞれが制御部２５０の制御により起動・制御されるものとする。 The operation of each step shown in FIG. 4 will be specifically described below. 3 includes the control unit 250. In the following description, it is assumed that each component that performs each process is activated and controlled by the control of the control unit 250.

ステップ１１０）テキスト入力処理：
まず、入力部２５１から図５に示すテキスト２６４が入力される。 Step 110) Text input processing:
First, the text 264 shown in FIG. 5 is input from the input unit 251.

ステップ１２０）テキスト分解処理：
テキスト分解部２０１は、入力されたテキストを一文字ずつ読み込み、図６に示すような文単位にＮ個に分割して分解文章記憶部２０２に格納する。ここで、文とは、句点「。」で区切られる一文をさす。テキスト２６４の一例として図３で示すようなテキスト２６４に対して、当該テキスト分解部２０１を実行すると、文単位に分解された９つの文４０１〜４０９が生成され分解文章記憶部２０２に格納される。テキスト分解部２０１において生成される文の個数は入力されるテキストによって異なる。また、句点「。」の入力ミスがあった場合は、複数の文が１つの文として扱われる。 Step 120) Text decomposition processing:
The text decomposition unit 201 reads the input text character by character, divides the text into N pieces as shown in FIG. Here, the sentence refers to one sentence delimited by the punctuation mark “.”. When the text decomposition unit 201 is executed on the text 264 as shown in FIG. 3 as an example of the text 264, nine sentences 401 to 409 decomposed into sentence units are generated and stored in the decomposed sentence storage unit 202. . The number of sentences generated in the text decomposition unit 201 varies depending on the input text. In addition, when there is an input error of a punctuation mark “.”, A plurality of sentences are treated as one sentence.

ステップ１３０）検索語抽出処理：
検索語抽出部２１１において、検索語を抽出する。検索語とは、ウェブ上でＡＮＤ検索（全ての単語が含まれる結果を求める検索）を行う際に入力する、一つまたは複数の単語をさす。はじめに、抽出検索語抽出部２１では、分解文章記憶部２０２に格納されている文章を読み出して、各文章について形態素解析を行う。そして、形態素解析により助詞を除く全ての単語を取り出す。そして活用形のある単語は原形に変換して抽出し、それ以外の単語は変換を行うことなく検索語として抽出する。 Step 130) Search term extraction processing:
The search word extraction unit 211 extracts a search word. A search term refers to one or more words that are input when performing an AND search (a search for a result including all words) on the web. First, the extraction search word extraction unit 21 reads out sentences stored in the decomposed sentence storage unit 202 and performs morphological analysis on each sentence. Then, all words excluding particles are extracted by morphological analysis. Then, words having a utilization form are converted to the original form and extracted, and other words are extracted as search words without conversion.

ここで、抽出された単語には「年」、「ある」、「ここ」のような一般的に使用される単語（一般語と呼ばれる）も含まれる。一般語は検索語として利用しても有益ではないため、図７に示すような一般語リストを予め作成し、一般語リスト記憶部５０１に登録しておき、一般語リストに登録されていない単語を検索語として扱い、検索語記憶部２１２に格納する。なお、検索語記憶部２１２に格納される検索語は一般語リストによって変化する。 Here, the extracted words include commonly used words (called general words) such as “year”, “present”, and “here”. Since it is not useful to use a general word as a search word, a general word list as shown in FIG. 7 is created in advance, registered in the general word list storage unit 501, and a word that is not registered in the general word list Are stored as search words and stored in the search word storage unit 212. Note that the search terms stored in the search term storage unit 212 vary depending on the general word list.

また、ウェブ検索を行う際に適切な個数の単語を使用することが望ましい。そこで、抽出された単語の個数が閾値Ｓ_Ｔ未満の場合には、検索語抽出部２１１では検索語は抽出せず、検索語記憶部２１２には何も格納しない。逆に、抽出単語の個数Ｓが閾値Ｔ以上の場合には、Ｓ個の検索語からＴ個の検索語をランダムに選択し、検索語記憶部２１２に格納する。Ｔ＝３０，Ｓ_Ｔ＝１の場合において、図６の文４０１〜文４０９に対して検索語抽出部２１１を実行すると、図８の検索語６０１〜６０９が検索語記憶部２１２に格納される。 It is also desirable to use an appropriate number of words when performing a web search. Therefore, if the extracted number of words is less than the threshold value S _T, the search word in the search word extraction unit 211 does not extract, the search word storage unit 212 does not store anything. Conversely, when the number S of extracted words is equal to or greater than the threshold value T, T search words are randomly selected from the S search words and stored in the search word storage unit 212. When T = 30 and S _T = 1, when the search word extraction unit 211 is executed for the sentences 401 to 409 in FIG. 6, the search words 601 to 609 in FIG. 8 are stored in the search word storage unit 212. .

ステップ１４０）関連語抽出処理：
図９は、本発明の一実施の形態における関連語抽出部の処理手順のフローチャートである。 Step 140) Related word extraction processing:
FIG. 9 is a flowchart of the processing procedure of the related word extraction unit in the embodiment of the present invention.

文４０１〜４０９に対応する検索語６０１〜６０９が作成された後、関連語取得部２１１では、はじめに、検索語抽出部２１１で抽出された検索語を検索語記憶部２１２から読み出す。次に、入力された検索語を用いてネットワーク２６１を介してウェブ２６２上でＡＮＤ検索を行う（ステップ１４１）。ＡＮＤ検索を行うことで検索語の入力する順序に影響せず、検索語を全て含む記事２６３をウェブ２６２で検索することができる。一般的に、ウェブ検索を行うと、入力された検索語に応じて関連性の高い記事から順に検索結果が得られる。そこで、検索結果で参照されているウェブ２６２の中から検索結果上位に含まれるＰ個の記事２６３を取得する。ここで、検索語記憶部２１２に該当する検索語が存在しない場合には、関連語取得部２２１ではウェブ検索を行わず、関連語記憶部２２２に対して何も格納しない。また、検索語の個数Ｓが閾値Ｔに対してＳ＝Ｔである場合にも、ウェブ検索を行わず関連語記憶部２２２には何も格納しない。 After the search terms 601 to 609 corresponding to the sentences 401 to 409 are created, the related word acquisition unit 211 first reads the search terms extracted by the search term extraction unit 211 from the search term storage unit 212. Next, an AND search is performed on the web 262 via the network 261 using the input search word (step 141). By performing the AND search, the article 263 including all the search words can be searched on the web 262 without affecting the input order of the search words. In general, when a web search is performed, search results are obtained in order from articles with high relevance according to the input search terms. Therefore, P articles 263 included in the upper search results are acquired from the web 262 referenced in the search results. Here, when there is no corresponding search word in the search word storage unit 212, the related word acquisition unit 221 does not perform web search and stores nothing in the related word storage unit 222. Even when the number S of search terms is S = T with respect to the threshold value T, no web search is performed and nothing is stored in the related term storage unit 222.

次に、関連語取得部２１１では、時間順に収集されたＰ個の記事２６３からテキストを抽出する（ステップ１４３）。記事２６３はＨＴＭＬやＸＭＬ等の構造化言語で記述されている。よって、得られた記事２６３に対して"＜"と"＞"で囲まれた文字列から構成されるタグを解析することでテキストが得られる。そして、抽出されたテキストに対して関連語取得部２２１は、形態素解析を行い、助詞を除くすべての単語を抽出する（ステップ１４４）。その際、検索語抽出部２１１と同様に、活用形のある単語は全て終止形に変換した単語を抽出し、それ以外の単語はそのままの形で単語を抽出する。 Next, the related word acquisition unit 211 extracts text from the P articles 263 collected in time order (step 143). The article 263 is described in a structured language such as HTML or XML. Therefore, a text can be obtained by analyzing a tag composed of a character string surrounded by “<” and “>” with respect to the obtained article 263. And the related word acquisition part 221 performs a morphological analysis with respect to the extracted text, and extracts all the words except a particle (step 144). At that time, as in the search word extraction unit 211, all the words having the utilization form are extracted as words that have been converted to the final form, and the other words are extracted as they are.

得られる関連語の個数はウェブ検索を行う際の検索語やウェブ検索により収集される記事２６３によって変化する。また、抽出した単語を直接関連語として使用すると一般語が関連語として扱われる場合がある。そこで、関連語取得部２２１では、検索語抽出部２１１と同様に、一般語リスト記憶部５０１を参照して一般語を除いた単語を使用する。具体的には、検索語がＳ個であるとき、Ｐ個のテキストから抽出し一般語リスト記憶部５０１に登録されている一般語を除いた単語に対し単語の出現頻度を算出する。そして、単語出現頻度の高い順にＴ−Ｓ個の単語を関連語として取得し、関連語記憶部２２２に格納する（ステップ１４５）。これにより、各文において抽出される検索語と関連語の合計個数は予め与えられた値Ｔと一定になるようにする。 The number of related terms obtained varies depending on the search term used when performing a web search and the article 263 collected by the web search. Further, when the extracted word is directly used as a related word, a general word may be handled as a related word. Therefore, in the related word acquisition unit 221, as in the search word extraction unit 211, the general word list storage unit 501 is referred to and a word excluding the general word is used. Specifically, when the number of search words is S, the appearance frequency of words is calculated for words excluding general words extracted from P texts and registered in the general word list storage unit 501. Then, TS words are acquired as related words in descending order of word appearance frequency, and stored in the related word storage unit 222 (step 145). As a result, the total number of search words and related words extracted in each sentence is set to a predetermined value T.

更に、適切な関連語を得るためには、ウェブ検索により得られる記事２６３の個数はできるだけ多い方がよい。そこで、ウェブ検索により得られるテキスト２６３の個数ＰがＰ_Ｔ未満の場合には（ステップ１４２、Ｎｏ）、検索語を修正し、再びウェブ上でＡＮＤ検索により記事２６３を収集する（ステップ１４６）。具体的には、Ｓ個の検索語からｉ番目（ｉ＝１，２，…，Ｓ）の単語を除いたＳ−１個の単語を検索語としてウェブ検索を行い検索される件数を調べる。例えば「ゴルフ」「ショット」「ドライブ」の検索語（Ｓ＝３）に対して、「ゴルフ」「ショット」「ショット」「ドライブ」「ゴルフ」「ドライブ」という３パターンの検索語を作成し、検索件数を調べる。そして、検索される件数が最大となるＳ−１個の単語を検索語として選択し、検索語記憶部２１２に上書きする（ステップ１４７）。更に、Ｓ＝Ｓ−１として検索語の個数を更新し（ステップ１４８）、再びウェブ検索を行いＰ個の記事を収集する。例えば、検索件数が「ゴルフ」「ショット」の場合で１０００件、「ショット」「ドライブ」が５００件、「ゴルフ」「ドライブ」が２００件の場合、「ゴルフ」「ショット」を検索語記憶部２１２に上書きし、Ｓ＝２と更新する。そして、「ゴルフ」「ショット」の検索語でウェブ検索を行い、Ｐ個の記事２６３を取得する。これらの処理を記事２６３の個数≧Ｐ_Ｔを満たすまで繰り返し行う。条件Ｐ≧Ｐ_Ｔを満たす場合、得られたＰ個の記事２６３から関連語を抽出する。一方で、検索語を繰り返して修正しても収集される記事２６３の個数がＰ_Ｔ以上とならない場合には、検索語記憶部２５０に格納されている当該検索語を削除し、更に、関連語記憶部２２２に対して関連語として何も格納しない。一例として、図８の検索語６０１〜６０９に対して、Ｔ＝３０，Ｐ_Ｔ＝２０のとき、関連語取得部２２１を実行し、得られた関連語を図１０の関連語８０１〜８０９に示す。 Furthermore, in order to obtain appropriate related terms, the number of articles 263 obtained by web search should be as large as possible. Therefore, when the number P of texts 263 obtained by web search is less than _PT (step 142, No), the search word is corrected, and articles 263 are collected again by AND search on the web (step 146). Specifically, a web search is performed using S-1 words obtained by excluding the i-th (i = 1, 2,..., S) word from the S search terms, and the number of searches is determined. For example, for the search terms “S Golf”, “Shot”, and “Drive” (S = 3), three search terms “Golf”, “Shot”, “Shot”, “Drive”, “Golf”, and “Drive” are created, Find the number of searches. Then, the S-1 words with the maximum number of searched items are selected as search terms and overwritten in the search term storage unit 212 (step 147). Further, the number of search terms is updated as S = S-1 (step 148), and the web search is performed again to collect P articles. For example, when the number of searches is “golf” and “shot”, 1000 cases, “shot” and “drive” are 500 cases, and “golf” and “drive” are 200 cases. 212 is overwritten and S = 2 is updated. Then, a web search is performed using the search terms “golf” and “shot”, and P articles 263 are acquired. These processes are repeated until the number of articles 263 ≧ _PT . When the condition P ≧ P _T is satisfied, related words are extracted from the P articles 263 obtained. On the other hand, if the number of collected articles 263 does not exceed P _T even if the search word is corrected repeatedly, the search word stored in the search word storage unit 250 is deleted, and the related word Nothing is stored as a related word in the storage unit 222. As an example, when T = 30 and P _T = 20 with respect to the search terms 601 to 609 in FIG. 8, the related term acquisition unit 221 is executed, and the obtained related terms are changed to the related terms 801 to 809 in FIG. Show.

ステップ１５０）分割候補生成処理：
分解文章記憶部２０２に格納されている全ての文に対して関連語取得部２２１の処理が終了すると、分割候補生成部２３１において、検索語記憶部２１２と関連語記憶部２２２に格納されている検索語と関連語をそれぞれ読み出し、それらを連結してキーワード集合を生成する。図８の検索語の例と図１０の関連語の例から作成したキーワード集合の例を図１１に示す。例えば、キーワード集合１００１は、検索語６０１と関連語８０１を連結して作成されたものである。作成されたキーワード集合は、キーワード集合記憶部２３２に格納される。ここで、分割候補生成部２３１では、検索語がない文に対してはそれに対応する関連語も存在しないため、キーワード集合を作成しない。 Step 150) Division candidate generation processing:
When the processing of the related word acquisition unit 221 is completed for all sentences stored in the decomposed text storage unit 202, the division candidate generation unit 231 stores the search word storage unit 212 and the related word storage unit 222. A search term and a related term are read out and connected to generate a keyword set. FIG. 11 shows an example of a keyword set created from the search word example of FIG. 8 and the related word example of FIG. For example, the keyword set 1001 is created by concatenating a search word 601 and a related word 801. The created keyword set is stored in the keyword set storage unit 232. Here, the division candidate generation unit 231 does not create a keyword set because there is no related word corresponding to a sentence having no search word.

キーワード集合は、テキストの内容を反映する単語であることから、キーワード集合に含まれる単語の変化を調べることでテキスト２６４における内容の変化を捉えることができる。そこで、分割候補生成部２３１では、生成されたキーワード集合を前後の複数文で比較し、内容的にまとまっている一文または複数の文から構成される意味段落を求める。比較の方法は、テキストは先頭から順に書かれることが一般的であるため、テキストの先頭から順に、複数のキーワード集合をまとめたブロックを作成し、比較を行う。 Since the keyword set is a word that reflects the content of the text, the change in the content in the text 264 can be captured by examining the change in the word included in the keyword set. Therefore, the division candidate generation unit 231 compares the generated keyword set with a plurality of preceding and following sentences, and obtains a semantic paragraph composed of one sentence or a plurality of sentences that are grouped in content. As a comparison method, text is generally written in order from the top, so a block in which a plurality of keyword sets are collected in order from the top of the text is created and compared.

図１２は、本発明の一実施の形態における分割候補生成部の処理手順のフローチャートである。 FIG. 12 is a flowchart of the processing procedure of the division candidate generation unit in one embodiment of the present invention.

具体的には、ｂをブロックの大きさとすると、ｉ＋１−ｂ番目からｉ番目までのキーワード集合が含まれるブロックＢ１と、ｉ＋１番目からｉ＋ｂ番目までのキーワード集合が含まれるブロックＢ２を決定し、二つのブロックＢ１とＢ２内に含まれるキーワード集合内の単語を比較する（但し、ｉ＝１，２，…，Ｎ）。単語が存在しないキーワード集合は、ブロックＢ１とブロックＢ２を作成する際には含めず、ブロックＢ１においては該当する文よりも前の文で空でないキーワード集合を、ブロックＢ２においては該当する文の後の文で空でないキーワード集合を代わりにブロックに含める。例えば、ｊ番目に対するキーワード集合が空の場合、ブロックＢ１作成時にはｊ−１、ｊ−２、…，１番目の順に空でないキーワード集合を発見し、ブロックＢ１に含める。一方、ブロックＢ２作成時にはｊ＋１，ｊ＋２，…の順に空で内キーワード集合を発見し、ブロックＢ２に含める。ブロック内に含めることができるキーワード集合が存在しない場合には、空のブロックを作成する。ブロックＢ１とブロックＢ２を作成後、それぞれのブロックに含まれる単語ｔの頻度ｗ_ｔを計算する。そして、ｉ番目とｉ＋１番目の二つの文の連結度を単語ｔの頻度ｗ_ｔを用いて以下の式で評価する。 Specifically, assuming that b is the size of a block, a block B1 including a keyword set from i + 1-bth to ith and a block B2 including a keyword set from i + 1th to i + bth are determined. The words in the keyword set included in the two blocks B1 and B2 are compared (where i = 1, 2,..., N). A keyword set that does not include a word is not included when creating blocks B1 and B2, and a keyword set that is not empty in a sentence before the corresponding sentence in block B1, and after a corresponding sentence in block B2. Include a non-empty keyword set in the block instead. For example, if the keyword set for the j-th is empty, a non-empty keyword set is found in the order of j−1, j−2,..., And included in the block B1 when the block B1 is created. On the other hand, when creating block B2, inner keyword sets are found empty in the order of j + 1, j + 2,... And included in block B2. If there is no keyword set that can be included in the block, an empty block is created. After creating the block B1 and the block B2, the frequency w _{t of the} word t included in each block is calculated. Then, to evaluate by the following equation using the frequency w _t of the i-th and (i + 1) th word connectivity of two statements t.

ｗ_ｔ ^B1はブロックＢ１にある単語ｔの頻度、ｗ_ｔ ^B2はブロックＢ２にある単語ｔの頻度を表す
。
w _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2.

Ｃ_ｉ ^ｂは０以上１以下の値をとり、１に近いほどブロックＢ１とブロックＢ２に含まれている単語が同じであることを表す。ここで、ブロックＢ１、またはブロックＢ２に含まれている単語が同じであることを表す。ここで、ブロックＢ１またはブロックＢ２内に単語が一切含まれない場合、連結度Ｃ_ｉ ^ｂの値は０と算出される。分割候補生成部２３１では、ｉ＝（１，２，…，Ｎ）と変化させ、 C _i ^b takes a value between 0 and 1, and the closer to 1, the more the words included in the block B1 and the block B2 are the same. Here, it represents that the word contained in block B1 or block B2 is the same. Here, if the word in the block B1 or block B2 is not included at all, the value of the connection degree C _i ^b is 0 is calculated. The division candidate generation unit 231 changes i = (1, 2,..., N),

を計算する。更に、ブロックの大きさｂのパラメータをｂ＝（ｂ_１，ｂ_２，…，ｂ_Ｍ）とＭ個設定して各ブロック幅に対して連結度Ｃ_ｉ ^ｂを計算する（ステップ１５１）。
Calculate Further, _M parameters are set as b = (b ₁ , b ₂ ,..., B _M ), and the connectivity C _i ^b is calculated for each block width (step 151).

次に、それらの平均値をｉ番目とｉ＋１番目の文における平均連結度Ｃ_ｉを以下の式により計算する（ステップ１５２）。 Next, the average connectivity C _i in the i-th and i + 1-th sentences is calculated from the average value by the following formula (step 152).

次に、分割候補生成部２３１では、平均連結度Ｃ_ｊ（但し、ｊ＝（１，２，…，Ｎ））を用いて平均連結度の谷、つまり、意味段落の境界を抽出し、分割する箇所の検出を行う（ステップ１５２）。平均連結度の谷は以下の条件を満たす平均連結度の極小値のことを表す。
Next, the division candidate generation unit 231 uses the average connectivity C _j (where j = (1, 2,..., N)) to extract the average connectivity valleys, that is, the boundaries of the semantic paragraphs, and divide them. The location to be detected is detected (step 152). The valley of average connectivity represents the minimum value of average connectivity satisfying the following conditions.

そして、分割候補生成部２３１では平均連結度の谷が検出されたときのＣ_ｊの値を小さい順に並び替え、それぞれに対応する文の番号ｄ_１，ｄ_２，…，ｄ_Ｋを求め、これらをテキスト２６４の分割箇所とする（ステップ１５３）。小さい順に並び替えたときの文の番号を用いることで、意味段落の境界が明瞭な順に分割を行うことができる。例えば、図５のテキスト３０１を用いて分割候補生成部２３１の処理を行うと、図１３に示す平均連結度１１０１が得られる（同図では、平均連結度の谷となる箇所に下線を付与してある）。平均連結度１１０１から、ｉ＝３とｉ＝６において、Ｃ_３＝０．１７４２１２，Ｃ_６＝０．２６９４５２となり、二つの平均連結度の谷が検出される。Ｃ_３とＣ_６を小さい順に並び替えそれぞれに対応する文の番号を調べると、ｄ_１＝３、ｄ_２＝６となり、Ｋ＝２となる。
The valleys of the average connectivity of the division candidate generating unit 231 is rearranged in ascending order the values of C _j when it is detected, number d ₁ statement corresponding to _each, d 2, _..., determine the d _K, these Is a division location of the text 264 (step 153). By using the sentence numbers when rearranged in ascending order, it is possible to perform division in the order in which the boundaries of the semantic paragraphs are clear. For example, when the division candidate generation unit 231 is processed using the text 301 in FIG. 5, the average connectivity 1101 shown in FIG. 13 is obtained (in FIG. 5, an underline is given to a location that is a valley of the average connectivity. ) From the average connectivity 1101, when i = 3 and i = 6, C ₃ = 0.174212 and C ₆ = 0.269452, and two valleys of average connectivity are detected. When C ₃ and C ₆ are rearranged in ascending order and the number of the sentence corresponding to each is examined, d ₁ = 3, d ₂ = 6, and K = 2.

最後に、抽出されたＫ個の分割箇所に対し、分割候補生成部２３１はｊ個（ｊ＝１，２，…，Ｋ）の分割箇所ｄ_１，ｄ_２，…，ｄ_ｊを用いてテキスト２６４をｄ_ｍとｄ_ｍ＋１番目（ｍ＝（１，２，…，ｊ）の文の間で分割し、ｊ番目の分割候補として、図１４に示すように、分割候補記憶部２４２に格納する（ステップ１５４）。つまり、意味段落の境界が明瞭な順に分割箇所を一つずつ増やしテキストを分割する。図５のテキスト３０１の例では、ｄ_１＝３、ｄ_２＝６となるので、２つの分割候補（Ｋ＝２）が得られる。ｊ＝１のときはｄ_１＝３のみを分割箇所として使用する。その結果、３番目と４番目の文の間で分割し、図１４の意味段落１２０１と意味段落１２０２の二つの意味段落が生成されるような分割候補が分割候補記憶部２４２に格納する。ｊ＝２のときは分割箇所としてｄ_１＝３、ｄ_２＝６が使用されるため、３番目と４番目、そして、６番目と７番目の文の間で分割する。つまり、意味段落１２０３と意味段落１２０４、意味段落１２０５の３つの意味段落が生成されるような分割候補が分割候補記憶部２４２に格納される。 Finally, with respect to the extracted K number of divided portions, the divided candidate generating unit 231 j number (j = 1, 2, ..., K) divided portion _d _1, d 2 of, ..., using _{d j} text H.264 is divided between _dm and dm _{+ 1} (m = (1, 2,..., J) sentences, and stored in the division candidate storage unit 242 as the jth division candidate as shown in FIG. (Step 154) That is, the text is divided by increasing the number of divisions one by one in the order in which the boundary of the semantic paragraph is clear.In the example of the text 301 in FIG. 5, d ₁ = 3 and d ₂ = 6, so 2 14 division candidates (K = 2) are obtained, and when j = 1, only d ₁ = 3 is used as the division part, and as a result, the division is made between the third and fourth sentences, and the meaning of FIG. A division candidate that generates two semantic paragraphs, a paragraph 1201 and a semantic paragraph 1202, is a division candidate. It is stored in the storage unit 242. When j = 2, d ₁ = 3 and d ₂ = 6 are used as the division points, so that the division is between the third and fourth, and the sixth and seventh sentences. That is, the division candidate storage unit 242 stores division candidates for generating three semantic paragraphs of the semantic paragraph 1203, the semantic paragraph 1204, and the semantic paragraph 1205.

ステップ１６０）分割結果評価処理：
分割候補生成部２３１にてＫ個の分割候補が作成されると、分割結果評価部２４１では、分割候補記憶部２４２に格納されている分割候補とキーワード集合記憶部２３２に格納されているキーワード集合を参照する。そして、分割候補記憶部２４２に格納されているＫ個の分割結果のうち、一つの結果を選択する処理を行う。 Step 160) Division result evaluation processing:
When K division candidates are created by the division candidate generation unit 231, the division result evaluation unit 241 has a division candidate stored in the division candidate storage unit 242 and a keyword set stored in the keyword set storage unit 232. Refer to Then, a process of selecting one result from the K division results stored in the division candidate storage unit 242 is performed.

図１５は、本発明の一実施の形態における分割結果評価部の処理手順のフローチャートである。 FIG. 15 is a flowchart of the processing procedure of the division result evaluation unit in one embodiment of the present invention.

分割結果評価部２４１では、はじめにキーワード集合記憶部２３２に格納されているキーワード集合を読出し、単語ｔの出現頻度ｗ_ｔ ^allを計算する（ステップ１６１）。例えば、図１１のキーワード集合１００１からキーワード集合１００９を用いて「ゴルフ」と「おいしい」の出現頻度を求めると、 The division result evaluation unit 241 first reads the keyword set stored in the keyword set storage unit 232 and calculates the appearance frequency w _t ^all of the word t (step 161). For example, when the appearance frequency of “golf” and “delicious” is obtained from the keyword set 1001 of FIG. 11 using the keyword set 1009,

となる。
It becomes.

次に、分割候補記憶部２４２からｉ番目（ｉ＝１，２，…，Ｋ）の分割候補を読出し、ｊ番目（ｊ＝１，２，…，ｉ＋１）の意味段落に含まれる文の範囲内でキーワード集合記憶部２３２に格納されているキーワード集合を参照し、単語ｔの出現頻度ｗ_ｔ ^ｊを計算する。例えば、図１４の分割候補記憶部２４２において、ｉ＝１のとき、意味段落１２０１と意味段落１２０２を参照するため、ｊ＝１，２となる。そして、ｊ＝１のとき、意味段落１２０１に含まれる文は１，２，３となるため、キーワード集合記憶部２３２に格納されているキーワード集合のうち、キーワード集合１００１からキーワード集合１００３までを参照し、単語ｔの出現頻度を求める。このとき単語「ゴルフ」と「おいしい」の出現頻度はそれぞれ Next, the i-th (i = 1, 2,..., K) division candidate is read from the division candidate storage unit 242, and the sentence range included in the j-th (j = 1, 2,..., I + 1) semantic paragraph The keyword set stored in the keyword set storage unit 232 is referred to, and the appearance frequency w _t ^j of the word t is calculated. For example, in the division candidate storage unit 242 of FIG. 14, when i = 1, the semantic paragraph 1201 and the semantic paragraph 1202 are referred to, so that j = 1 and 2. When j = 1, since the sentences included in the semantic paragraph 1201 are 1, 2, and 3, refer to the keyword set 1001 to the keyword set 1003 among the keyword sets stored in the keyword set storage unit 232. Then, the appearance frequency of the word t is obtained. At this time, the appearance frequency of the words “golf” and “delicious”

となる。ｊ＝２のとき、意味段落１２０２の意味段落に含まれる文は４，５，６，７，８，９となるため、キーワード集合１００４からキーワード集合１００９までを参照し単語ｔの出現頻度を求める。このとき、単語「ゴルフ」と「おいしい」は
It becomes. When j = 2, the sentences included in the semantic paragraph of the semantic paragraph 1202 are 4, 5, 6, 7, 8, and 9. Therefore, the appearance frequency of the word t is obtained by referring to the keyword set 1004 to the keyword set 1009. . The words “golf” and “delicious”

となる。
It becomes.

ｉ番目（ｉ＝１，２，…，Ｋ）の分割候補に含まれるそれぞれの意味段落に対応する単語ｔの出現頻度を計算した後、分割結果評価部２４１では求めた出現頻度ｗ_ｔ ^allとｗ_ｔ ^ｊ（ｊ＝１，２，…，ｉ＋１）を用いて次の評価関数Ｑ_ｉの値を計算する。 After calculating the appearance frequency of the word t corresponding to each semantic paragraph included in the i-th (i = 1, 2,..., K) division candidate, the division result evaluation unit 241 calculates the appearance frequency w _t ^all obtained. The value of the next evaluation function Q _i is calculated using w _t ^j (j = 1, 2,..., i + 1).

ここで、Ｑ_ｉ ^１は、０以上１以下の値をとる分割の細かさを図る指標であり、テキスト２６４を細かく分割する程小さい値をとる。Ｑ_ｉ ^１２は、０以上１以下の値をとる意味段落間の内容の異なり度合いを測る指標であり、ｊ番目とｊ＋１番目の意味段落間で内容が異なる程小さい値をとる。この２つの指標の和であるＱ_ｉを最小にする分割を求めることで、内容毎に細かく分割する結果を求めることができる。図１４の分割候補の例でｉ＝１のときの計算例を説明する。ｉ＝１のとき分割候補記憶部２４２に格納されている意味段落は二つあるため、Ｑ_１ ^１は、キーワード集合全体と一つ名の意味段落のキーワード集合、そして、キーワード集合全体と二つ目の意味段落のキーワード集合との比較により計算され、評価値は、Ｑ_１ ^１＝0.801624となる。
Here, Q _i ¹ is an index that aims at the fineness of division that takes a value between 0 and 1, and takes a smaller value as the text 264 is finely divided. Q _i ¹² is an index for measuring the degree of content difference between semantic paragraphs having a value between 0 and 1, and takes a smaller value as the content differs between the jth and j + 1th semantic paragraphs. By obtaining a division that minimizes Q _i that is the sum of these two indices, a result of fine division for each content can be obtained. An example of calculation when i = 1 in the example of division candidates in FIG. 14 will be described. Since there are two semantic paragraphs stored in the division candidate storage unit 242 when i = 1, Q ₁ ¹ is the entire keyword set and the keyword set of one semantic paragraph, and the entire keyword set and two The evaluation value is Q ₁ ¹ = 0.801624, which is calculated by comparison with the keyword set of the meaning paragraph of the eyes.

次に、Ｑ_１ ^２は、一つ目の意味段落のキーワード集合と二つ目の意味段落のキーワード集合との比較により計算され、評価値は、Ｑ_１ ^２＝0.316735となる。そして、二つ目の指標の和を求めるとＱ_１=1.118359と計算される（ステップ１６２）。計算が終了すればｉ＝ｉ＋１として（ステップ１６３）、分割候補記憶部２４２に格納されている次のｉ＝２のときの分割候補を参照し、同様の計算を行う。そして、上記の計算を分割候補記憶部２４２に格納されている全ての分割候補に対して繰り返し計算を行う（ステップ１６４）。図１４の分割候補の例では、分割候補番号ｉ＝２に対する評価関数の値Ｑ_２の計算で終了となる。 Next, Q ₁ ² is calculated by comparing the keyword set of the first semantic paragraph with the keyword set of the second semantic paragraph, and the evaluation value is Q ₁ ² = 0.316735. When the sum of the second index is obtained, Q ₁ = 1.118359 is calculated (step 162). When the calculation is completed, i = i + 1 is set (step 163), and the next division candidate when i = 2 stored in the division candidate storage unit 242 is referred to, and the same calculation is performed. Then, the above calculation is repeated for all the division candidates stored in the division candidate storage unit 242 (step 164). In the example of the division candidate in FIG. 14, the calculation ends with the calculation of the evaluation function value Q ₂ for the division candidate number i = 2.

次に、分割候補記憶部２４２に格納されているＫ個の分割候補に対して、Ｑ_ｉ（ｉ＝１，２，…，Ｋ）の計算が終了すると（ステップ１６４，Ｙｅｓ）、最後に分割結果評価部２４１では、Ｑ_ｉを最小にするｉ番目の分割候補を選択する（ステップ１６５）。図１４の分割候補の例では、分割候補番号ｉ＝１，２に対して、Ｑ_１=1.118359、、Ｑ_２=0.990127となるため、分割候補記憶部２４２に格納されている二つの分割候補のうちｉ＝２が選択される。 Next, when the calculation of Q _i (i = 1, 2,..., K) is completed for the K division candidates stored in the division candidate storage unit 242 (step 164, Yes), the division is finally performed. The result evaluation unit 241 selects the i-th division candidate that minimizes Q _i (step 165). In the example of the division candidates in FIG. 14, Q ₁ = 1.118359 and Q ₂ = 0.990127 with respect to the division candidate numbers i = 1 and 2, so that two division candidates stored in the division candidate storage unit 242 are stored. Of these, i = 2 is selected.

ステップ１７０）選択結果出力処理：
分割結果評価部２４１において、評価関数Ｑ_ｉが最小となる分割候補の番号が選択されると、分割結果評価部２４１で選択された分割候補の番号を出力部２５２に渡す。出力部２５２は、当該番号を受け取ると、分割候補記憶部２４２に格納されている分割候補の中から受け取った番号に対応する分割候補を読み取り、表示部２６５に分割結果として出力する。図１４の分割候補の例では、ｉ＝２が出力部２５２に渡されるので、出力部２４２は分割候補記憶部２４２に格納されている２番目の分割候補を読出し、意味段落１２０３から意味段落１２０５までをテキストセグメンテーション結果として、表示部２６５に出力する。 Step 170) Selection result output processing:
When the division result evaluation unit 241 selects a division candidate number that minimizes the evaluation function Q _{i, the} division candidate evaluation unit 241 passes the division candidate number selected by the division result evaluation unit 241 to the output unit 252. When receiving the number, the output unit 252 reads the division candidate corresponding to the received number from the division candidates stored in the division candidate storage unit 242, and outputs the division candidate to the display unit 265 as the division result. In the example of the division candidate in FIG. 14, since i = 2 is passed to the output unit 252, the output unit 242 reads out the second division candidate stored in the division candidate storage unit 242, and the semantic paragraph 1203 to the semantic paragraph 1205. Are output to the display unit 265 as a text segmentation result.

なお、上記の図３に示す構成の動作をプログラムとして構築し、テキストセグメンテーション装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Note that it is possible to construct the operation of the configuration shown in FIG. 3 as a program and install it on a computer used as a text segmentation device to execute it, or distribute it via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、コンピュータ上で各種記事や物語等の文章中の各文を意味的なまとまりに分割する技術に適用可能である。 The present invention can be applied to a technique for dividing each sentence in sentences such as various articles and stories on a computer into semantic groups.

２０１テキスト分割手段、テキスト分割部
２０２分解文章記憶手段、分解文章記憶部
２１１検索語抽出手段、検索語抽出部
２２１関連語取得手段、関連語取得部
２２２関連語記憶手段、関連語記憶部
２３１分割候補生成手段、分割候補生成部
２３２キーワード集合記憶部
２４１分割結果評価手段、分割結果評価部
２５０制御部
２５１入力部
２５２出力部
２６０コンピュータ
２６１ネットワーク
２６２ウェブ
２６３構造化言語で記述された記事
２６４テキスト
２６５表示部
５０１一般語リスト記憶部

201 Text segmentation unit, text segmentation unit 202 Decomposed sentence storage unit, Decomposed sentence storage unit 211 Search term extraction unit, Search term extraction unit 221 Related term acquisition unit, Related term acquisition unit 222 Related term storage unit, Related term storage unit 231 Candidate generation means, division candidate generation section 232 Keyword set storage section 241 Division result evaluation means, division result evaluation section 250 Control section 251 Input section 252 Output section 260 Computer 261 Network 262 Web 263 Article 264 text 265 written in a structured language Display unit 501 General word list storage unit

Claims

A text segmentation device that divides text according to content,
Text decomposition means for dividing the input text into sentence units and storing it in the divided sentence storage means;
The sentence divided by the text disassembling means is morphologically analyzed, and at least the particles are excluded from the words subjected to the morphological analysis, and further, the search words are extracted by excluding the words registered in the general word list prepared in advance. Search term extraction means for storing in the search term storage means;
Web search based on the search term, morphological analysis of the searched text, by removing at least particles from the morphologically analyzed words, and further by removing the words registered in the general word list created in advance Related word acquisition means for extracting related words and storing them in related word storage means;
Sentences stored in the divided sentence storage means using a keyword set that is a combination of the search words stored in the search word storage means and the related words stored in the related word storage means A division candidate generation unit that obtains a semantic paragraph based on the connectivity of the two, creates a division candidate, and stores the division candidate in a division candidate storage unit;
A division result evaluation unit that evaluates the division candidates stored in the division candidate storage unit and selects and outputs one division result; and
Have
The division result evaluation means includes
The frequency of appearance of each keyword is obtained by referring to the keyword set within the range of sentences included in the meaning paragraph of the candidate for division stored in the candidate division storage unit, and the division is performed based on the appearance frequency. A text segmentation apparatus comprising: means for evaluating all division candidates stored in the candidate storage means to obtain an evaluation value, and selecting a division candidate having the smallest evaluation value.

The division result evaluation means includes
In obtaining the evaluation value, a first index that takes a smaller value as the input text is divided finely and a second index that takes a smaller value as the contents differ between the semantic paragraphs are obtained, The text segmentation apparatus according to claim 1, wherein a sum of one index and the second index is an evaluation value.

The division candidate generation means includes:
A semantic paragraph generating means for comparing the keyword set with a plurality of preceding and following sentences and obtaining the semantic paragraph composed of one sentence or a plurality of sentences that are grouped in content;
The semantic paragraph generating means includes
The Create a block B1, B2 summarizes the keyword set, the degree of coupling C _i ^b of the i-th and (i + 1) th two sentences, using the occurrence frequency of the word t,
(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2. C _i ^b takes a value from 0 to 1, and the closer to 1, the block B1 And the words contained in block B2 are the same)
Means to obtain
i = {1, 2,..., N},

And the block size b parameter is set as b = (b ₁ , b ₂ ,..., B _M ), the connectivity C _i ^b is calculated for each block width, and the average of them is calculated. The average connectivity C _i in the i-th and i + 1-th sentences is expressed as

Means to obtain
Using the average connectivity C _i (where i = (1, 2,..., N))

Means for extracting based on the valley and obtaining the semantic paragraph based on the valley;
The text segmentation device according to claim 1, comprising:

A text segmentation method that divides text according to content,
A text disassembling unit that divides the input text into sentence units and stores the divided text in the divided sentence storage unit;
The search word extraction means performs morphological analysis on the sentence divided in the text decomposition step, removes at least particles from the words subjected to morphological analysis, and further removes words registered in a general word list created in advance. A search term extraction step for extracting a search term and storing it in the search term storage means;
The related word acquisition means searches the web based on the search word, morphologically analyzes the searched text, removes at least particles from the morphologically analyzed word, and is registered in a general word list created in advance. A related word acquisition step of extracting a related word by removing the word and storing it in a related word storage means;
The division candidate generation means uses the keyword set which is a combination of the search word stored in the search word storage means and the related word stored in the related word storage means, to the divided sentence storage means. A division candidate generation step for obtaining a semantic paragraph based on connectivity between stored sentences, creating a division candidate, and storing the division candidate in a division candidate storage unit;
A division result evaluation unit that evaluates the division candidates stored in the division candidate storage unit and selects and outputs one division result; and
A text segmentation method characterized by:

In the division result evaluation step,
In obtaining the evaluation value, a first index that takes a smaller value as the input text is divided finely and a second index that takes a smaller value as the contents differ between the semantic paragraphs are obtained, The text segmentation method according to claim 4, wherein the sum of the first index and the second index is an evaluation value.

In the division candidate generation step,
Performing the semantic paragraph generation step of comparing the keyword set with a plurality of sentences before and after, and obtaining the semantic paragraph composed of one sentence or a plurality of sentences that are grouped in content;
The semantic paragraph generation step includes:
The Create a block B1, B2 summarizes the keyword set, the degree of coupling C _i ^b of the i-th and (i + 1) th two sentences, using the occurrence frequency of the word t,

(W _t ^B1 represents the frequency of the word t in the block B1, and w _t ^B2 represents the frequency of the word t in the block B2. C _i ^b takes a value from 0 to 1, and the closer to 1, the block B1 And the words contained in block B2 are the same)
A step to obtain by
i = {1, 2,..., N},

And the block size b parameter is set as b = (b ₁ , b ₂ ,..., B _M ), the connectivity C _i ^b is calculated for each block width, and the average of them is calculated. The average connectivity C _i in the i-th and i + 1-th sentences is expressed as

A step to obtain by
Using the average connectivity C _i (where i = (1, 2,..., N))

And extracting the semantic paragraph based on the valley;
The text segmentation method according to claim 4, wherein:

The text segmentation program for functioning a computer as each means which comprises the text segmentation apparatus of any one of Claims 1 thru | or 3.

A computer-readable recording medium storing the text segmentation program according to claim 7.