JP3578618B2

JP3578618B2 - Document splitting device

Info

Publication number: JP3578618B2
Application number: JP04472198A
Authority: JP
Inventors: 雅之亀田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-02-26
Filing date: 1998-02-26
Publication date: 2004-10-20
Anticipated expiration: 2018-02-26
Also published as: JPH11242684A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書分割装置に関するものである。
【０００２】
【従来の技術】
例えば、複数の新聞記事が一緒になっている文書に対して、キーワード抽出等を行うと、新聞記事がいくつかの分野を含んでいると様々なキーワードが混ざり合って抽出されてしまう。また、文書検索において、指定された検索語を含む文書を検索しても、大きな文書であると、検索語に関連の深い部分を探すことが必要になるが、予め分割した単位を対象とすることができれば、直ちにその分割単位に辿り着くことができる。
このように、一つの文書を内容のまとまり毎に分割することができると、様々な文書処理が容易になる、即ち、文書を内容のまとまり毎に分割することにより、有用な文書処理を実現することができる。
【０００３】
文書分割の最も単純な方法としては、異なり語数の増加率に着目して、増加率の極小値を切れ目として認識する方法［１］、異なり語数の増加率の代わりに、意味レベルの単語の類似単語の結束度を文書上の位置の一定の窓幅での総和を用いる方法［２］がある。また、意味的に関連性のある語の連鎖（語彙的結束性）に着目する方法［３］がある。
これらの方法は、近傍の連鎖に着目しており、隣接間の関連に着目したボトムアップ的な視点での処理となっているが、文書分割のような処理は、トップダウン的な広い視野の処理が必要である。
【０００４】
【発明が解決しようとする課題】
本発明では、隣接間の関連だけでなく、広域的な関連も考慮に入れた文書分割装置を提供するものである。
【０００５】
【課題を解決するための手段】
請求項１の発明は、電子化された文書から言語要素を切り出して記憶手段に記憶する言語要素切り出し手段と、前記記憶手段に記憶された全ての言語要素同士の関連度を、言語要素内の共通の文字、単語、類義単語のいずれか毎の総和の割合により求めて言語要素間関連度行列記憶手段へ記憶する言語要素間関連度行列取得手段と、前記言語要素間関連度行列記憶手段へ記憶された言語要素間関連度行列を、各部分行列の内の要素の関連度の密度と各部分行列の外の要素の関連度の密度との比を評価値とし、該評価値を用いて最適な関連性の高い部分行列の並びに分割して部分行列記憶手段へ記憶する行列分割手段とを有し、前記行列分割手段により分割された部分行列に対して、再帰的に前記行列分割手段を用いることにより、階層的に文書を分割する文書分割装置である。
【０００６】
請求項２の発明は、請求項１に記載された文書分割装置において、前記言語要素切り出し手段は、段落、文、行のいずれかを言語要素として切り出す文書分割装置である。
【００１０】
請求項３の発明は、請求項１または２に記載された文書分割装置において、前記行列分割手段で用いる評価値は、分割された部分行列の内容のまとまりの程度を表す文書分割装置である。
【００１８】
【発明の実施の形態】
図１は、本発明の文書分割装置を概略的に示したブロック図であって、図中、Ｄは本発明の文書分割装置による分割の対象となる電子化文書、１は言語要素切出し手段であって電子化文書から言語要素群ＬＥを切り出す。切り出された言語要素群ＬＥは、後述のように言語要素間関連度評価手段２によって相互の関連度が評価され、言語要素間関連度行列取得手段３は前記言語要素間関連度評価手段２の評価結果に基づいて言語要素間関連度行列３ａを作成する。行列分割手段４は前記言語要素間関連度行列３ａから関連度に応じて文書を分割する。
【００１９】
以下、本発明をその実施例について詳しく説明する。
まず、電子化された文書は、言語要素切り出し手段により、言語要素群に切り出される。この際の言語要素とは、文書における行であったり、文であったり、段落などであり、特に、いずれかを特定するものではない。これらの言語要素を単位とする切り出しは、一定文字数によったり、句点や改行で容易に切り出すことができる。
【００２０】
【表１】

【００２１】
例えば、表１の文書を改行コードあるいは行先頭から４０文字目で強制的に切り出せば、表２のようになる（［］内は、切り出し番号である）。
【００２２】
【表２】

【００２３】
また、改行コードあるいは句点で切り出せば、表３のような文の切り出しを得る。
【００２４】
【表３】

【００２５】
さらに、改行だけで切り出せば、表４のようになる。
【表４】

【００２６】
次に、言語要素間関連度行列取得手段及び言語要素間関連度評価手段について説明する。
切り出された言語要素群から、言語要素間関連度行列取得手段により、全ての言語要素間の関連度から成る言語要素間関連度行列が得られる。この際、２つの言語要素間の関連度は、言語要素間関連度評価手段により求められる。言語要素間の関連度の評価法は、例えば、最も簡単には、一方の言語要素内の文字の総数のうち、もう一方の言語要素内の文字と共通の文字数の割合をみればよい。あるいは、日本語の場合であれば、文字のうち文字自体に意味を持たない「かな文字」を除いた文字とするのもよい。また、文を文形態素（単語）に分割する形態素解析系を用いて共通単語の割合を用いてもよい。さらに、シソーラス辞書等を用いて異なる単語でも語彙的な関連を考慮して関連度を求めてもよい。
例えば、表３の第２文と第４文
【００２７】
【表５】

【００２８】
この２文の文字総数は、第２文は７文字、第４文は４１文字であり、共通文字は、「輸」、「出」、「規」、「制」、「が」、「始」、の６文字であり、第２文と第４文の関連度は、第２文から見た場合は、６／７＝０．８５７、第４文から見た場合は、６／４１＝０．１４６となる。この場合は、２つの関連度が得られるが、両文を全体から見た場合は６×２／（７＋４１）＝１２／４８＝０．２５を一つの関連度として用いてもよい。
また、文字から平仮名と句読点を除くと、第２文は６文字、第４文は２６文字、共通文字は５文字であるから、５／６＝０．８３３と５／２６＝０．１９２、あるいは５×２／（６＋２６）＝０．３１２５が得られる。
【００２９】
また、形態素解析系を用いて単語分割し、一般名詞やサ変動詞を切り出すと、
２：輸出、規制、始動
４：通常、兵器、部品、加工、機械、転用、工業、製品、輸出、規制、日本
となり、第２文が３単語、第４文が１１単語、共通単語が「輸出」と「規制」の２単語なので、２／３＝０．６６７と２／１１＝０．１８２、あるいは２×２／（３＋１１）＝４／１２＝０．２９が得られる。
さらに異なる単語であっても、２単語間の意味的な関連度がシソーラス辞書等を用いて得ることができれば、その関連度を共通単語数に加えて計算することができる。
このようにして、全ての言語要素間の関連度を計算することによって、言語要素間関連度行列を得ることができる。例えば、言語要素を文として、関連度を共通単語の割合によるとすれば、表６のような文間関連度マトリックスを得ることができる。
【００３０】
【表６】

【００３１】
なお、表６の行列要素は、関連度を１０倍して四捨五入により、１桁の整数で表示してある。また、自分自身との関連度に当たる要素は‘＊’で示している。
【００３２】
次に、行列分割手段について説明する。
以上で得られた言語要素間関連度行列を行列分割手段が関連度の高い部分行列に分割する。
表７は段落間関連度行列から関連度の高い部分行列の並びを抽出・分割したイメージを示したものである。表中、‘Ｈ’は高い関連度、‘Ｌ’は低い関連度を示す。
【００３３】
【表７】

【００３４】
上記のような部分行列に分割された場合は、分割された部分行列の範囲に応じて部分文書が対応する。
この行列の分割の指標としては、（ａ）部分行列の内の関連度の平均値（Ｈの平均値）に対する、（ｂ）部分行列外の関連度の平均値（Ｌの平均値）あるいは（ｃ）行列全体の関連度の平均値の比、等を用いてその値が最小になるような分割を計算する。この計算はよく知られた動的計画法等を用いれば効率的に行うことができる。
部分行列の内外の比を用いるとすると、表７では、
（Ｌの総和／Ｌの要素数）／（Ｈの総和／Ｈの要素数）
が評価指標となる。
【００３５】
表８は、１０の新聞記事を連結した文書の段落間関連度行列である（要素＃は、１０の意味である）。表９の関連度行列に対して、前記の指標に基づき、最適な分割を計算すると、次のような部分行列が抽出・分割される。
第１段落〜第６段落
第７段落〜第１３段落
第１４段落〜第１７段落
第１８段落〜第２７段落
第２８段落〜第段３３落
第３４段落〜第３６段落
第３７段落〜第３８段落
第３９段落〜第５５段落
ここで、Ｌの総和／Ｌの要素数＝１６５．３７／２４８６＝０．０６７
Ｈの総和／Ｈの要素数＝１３４５．６２／５３９＝２．４９７
から、評価値は、０．０２６６である。
【００３６】
【表８】

【００３７】
【表９】

【００３８】
また、部分行列に対し、再帰的に分割を施すとさらに内部の分割を得ることができる。例えば、第１８段落から第２７段落の部分行列に対しては、評価値は０．２３７であって、第１８段落〜第２１段落及び第２２段落〜第２７段落の分割が得られる（表１０）。
【００３９】
【表１０】

【００４０】
第３９段落から第５５段落の部分行列に対しては、評価値は０．６８１で
第３９段落〜第４０段落、第４１段落、第４２段落、第４３段落〜第４８段落、第４９段落、第５０段落〜第５２段落、及び第５３段落から第５５段落、の分割が得られる（表１１）。
【００４１】
【表１１】

【００４２】
表９，表１０，表１１の分割の評価値が各々０．０２６２，０．２３７，０．６８１と大きくなるにつれて、内容の分割の程度が緩くなっていることが分かるように、分割の指標として用いることができる。即ち、この指標が小さい程、内容のまとまりが強く、大きい程まとまりは弱いことを示唆している。
【００４３】
また、本発明の文書分割装置によって分割された文書は、その分割に応じて文書を識別して表示する文書表示手段、前記分割された部分文書ごとに文書を処理する文書処理手段、文書からのキーセンテンス抽出処理を用いて抄録表示を行う文書抄録手段、前記分割された部分文書ごとに文書を管理する文書管理手段、及び、文書を分割された単位で検索対象として管理する文書検索手段において利用される。
【００４４】
【発明の効果】
本発明によれば、次のような効果を奏する。
（１）文書間の広域的な関連をも考慮にいれた文書分割手法であるため、従来の隣接間の関連による手法に比して、より適切な文書分割を行うことができる。
（２）目的に応じて文書を自由に切り出すことができる。
（３）簡単かつ容易に文書の関連度を評価することができる。
（４）再帰的に分割を行い階層的な分割が可能であるからより精度の高い文書の内部構造分析が可能である。
（５）文書間における文書の内容のまとまりの強弱を容易に把握することができる。
【図面の簡単な説明】
【図１】本発明を実施するための装置を概略的に示すブロック図である。
【符号の説明】
１…言語要素切出し手段、２…言語要素間関連度評価手段、３…言語要素間関連度行列取得手段、３ａ…言語要素間関連度行列、４…行列分割手段、Ｄ…電子化文書、ＤＤ…分割文書、ＬＥ…言語要素群。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document divider equipment.
[0002]
[Prior art]
For example, if keyword extraction or the like is performed on a document in which a plurality of newspaper articles are combined, various keywords are mixed and extracted if the newspaper article includes several fields. Also, in a document search, even if a document including a specified search term is searched, if the document is a large document, it is necessary to search for a part closely related to the search term. If we can do that, we can get to that division immediately.
As described above, if one document can be divided for each unit of content, various document processes are facilitated. That is, useful document processing is realized by dividing a document for each unit of content. be able to.
[0003]
The simplest method of document segmentation is to focus on the rate of increase in the number of different words and recognize the minimum value of the rate of increase as a break [1]. There is a method [2] that uses the sum of word cohesion at a fixed window width at a position on a document. Further, there is a method [3] that focuses on semantically related word chains (lexical cohesion).
These methods focus on the chain of the neighborhood, and are processing from the bottom-up viewpoint focusing on the relationship between the neighbors. However, processing such as document division is performed from the top-down Action is required.
[0004]
[Problems to be solved by the invention]
In the present invention, not only the relationship between adjacent, wide-area related also is intended to provide a document content WariSo location taking into account.
[0005]
[Means for Solving the Problems]
The invention according to claim 1, and language elements clipping means for storing in and storing means cut out the language elements from electronic documents, the relevance of all the entities together, which is stored in the storage means, language elements common characters, words, and language elements relevancy matrix acquisition means you store the language elements relevancy matrix storage means determined by the ratio of the sum of each one of the words synonymous, between the entities relevance inner The inter-language element relevance matrix stored in the matrix storage means is used as an evaluation value, wherein the ratio of the density of relevance of elements in each submatrix to the density of relevance of elements outside each submatrix is used as an evaluation value. and a matrix dividing means for storing divided into a sequence of optimum relevant submatrix to the partial matrix storage unit by using the evaluation value for more divided submatrix said matrix dividing means, recursive by using the matrix dividing means, hierarchically sentence It is a document dividing device for dividing.
[0006]
According to a second aspect of the present invention, in the document division device according to the first aspect, the language element cutout unit cuts out any of a paragraph, a sentence, and a line as a language element.
[0010]
A third aspect of the present invention, the document dividing apparatus according to

claim

1 or 2, evaluation value used in the matrix dividing means is a document divider device representing the degree of collection of contents of divided portions matrix.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 is a block diagram schematically showing a document dividing apparatus according to the present invention. In FIG. 1, D denotes an electronic document to be divided by the document dividing apparatus according to the present invention, and 1 denotes a language element extracting unit. Then, a language element group LE is cut out from the digitized document. The cut-out language element group LE is evaluated for mutual relevance by the inter-language element relevance evaluation means 2 as described later. The inter-language-element relevance matrix 3a is created based on the evaluation result. The matrix dividing means 4 divides the document according to the relevance from the inter-language element relevance matrix 3a.
[0019]
Hereinafter, the present invention will be described in detail with reference to examples.
First, the digitized document is cut out into a language element group by a language element cutting unit. The language element at this time is a line, a sentence, a paragraph, or the like in the document, and does not particularly specify any of them. The clipping in units of these language elements can be easily clipped by a fixed number of characters or by a period or a line break.
[0020]
[Table 1]

[0021]
For example, if the document in Table 1 is forcibly cut out at the line feed code or at the 40th character from the head of the line, the document becomes as shown in Table 2 (in [], the cutout number is shown).
[0022]
[Table 2]

[0023]
Also, if the clipping is performed at a line feed code or a period, a clipping of a sentence as shown in Table 3 is obtained.
[0024]
[Table 3]

[0025]
Furthermore, if it is cut out only by a line feed, it becomes as shown in Table 4.
[Table 4]

[0026]
Next, the inter-language-element relevance matrix acquisition means and the inter-language-element relevance evaluation means will be described.
From the cut-out language element group, an inter-language-element relevance matrix acquisition unit obtains an inter-language-element relevance matrix including the relevance between all the language elements. At this time, the degree of association between the two language elements is obtained by the means for evaluating the degree of association between language elements. For example, the simplest method of evaluating the degree of association between language elements is to look at the ratio of the number of characters common to the characters in the other language element to the total number of characters in the one language element. Alternatively, in the case of Japanese, the character may be a character excluding "kana character" which has no meaning in the character itself. Further, the ratio of common words may be used using a morphological analysis system that divides a sentence into sentence morphemes (words). Furthermore, the degree of relevancy may be obtained by considering the lexical relevance of different words using a thesaurus dictionary or the like.
For example, the second and fourth sentences in Table 3
[Table 5]

[0028]
The total number of characters in these two sentences is 7 in the second sentence and 41 in the fourth sentence, and the common characters are “Yu”, “D”, “R”, “K”, “GA”, and “ ”, And the degree of association between the second sentence and the fourth sentence is 6/7 = 0.857 when viewed from the second sentence, and 6/41 = when viewed from the fourth sentence. 0.146. In this case, two relevance levels are obtained, but when both sentences are viewed as a whole, 6 × 2 / (7 + 41) = 12/48 = 0.25 may be used as one relevance level.
In addition, if hiragana and punctuation are excluded from the characters, the second sentence is 6 characters, the fourth sentence is 26 characters, and the common character is 5 characters, so that 5/6 = 0.833 and 5/26 = 0.192, Alternatively, 5 × 2 / (6 + 26) = 0.3125 is obtained.
[0029]
In addition, when words are divided using a morphological analysis system and common nouns and sa-variables are cut out,
2: export, regulation, start 4: normal, weapon, parts, processing, machinery, diversion, industry, product, export, regulation, Japan, the second sentence is 3 words, the fourth sentence is 11 words, and the common word is " Since there are two words, "export" and "restriction", 2/3 = 0.667 and 2/11 = 0.182, or 2 × 2 / (3 + 11) = 4/12 = 0.29 are obtained.
Furthermore, even if the words are different, if the semantic relevance between two words can be obtained using a thesaurus dictionary or the like, the relevance can be calculated in addition to the number of common words.
In this way, by calculating the relevance between all language elements, a relevance matrix between language elements can be obtained. For example, if the linguistic element is a sentence and the degree of relevancy depends on the ratio of common words, an inter-sentence relevance matrix as shown in Table 6 can be obtained.
[0030]
[Table 6]

[0031]
The matrix elements in Table 6 are represented as single-digit integers by multiplying the relevance by 10 and rounding. In addition, an element corresponding to the degree of relevance to itself is indicated by '*'.
[0032]
Next, the matrix dividing means will be described.
The matrix dividing means divides the inter-language-element relevance matrix obtained above into sub-matrices having a high relevance.
Table 7 shows an image in which the arrangement of sub-matrices with high relevance is extracted and divided from the inter-paragraph relevance matrix. In the table, 'H' indicates a high degree of association and 'L' indicates a low degree of association.
[0033]
[Table 7]

[0034]
When divided into the above-described sub-matrices, the sub-documents correspond according to the range of the divided sub-matrices.
As an index of this matrix division, (a) average value of relevance (average value of H) in the submatrix and (b) average value of relevance value (average value of L) outside the submatrix or ( c) Calculate the division that minimizes the value by using the ratio of the average value of the degree of association of the entire matrix. This calculation can be efficiently performed by using a well-known dynamic programming method or the like.
Assuming that the ratio of the inside and outside of the submatrix is used, in Table 7,
(Sum of L / number of elements of L) / (sum of H / number of elements of H)
Is an evaluation index.
[0035]
Table 8 is an inter-paragraph relevance matrix of a document in which ten newspaper articles are linked (element # means ten). When the optimal division is calculated for the association matrix shown in Table 9 based on the above-mentioned index, the following sub-matrix is extracted and divided.
Paragraph 1 to Paragraph 6, Paragraph 7, Paragraph 7, Paragraph 13, Paragraph 14, Paragraph 17, Paragraph 18, Paragraph 18, Paragraph 27, Paragraph 28, Paragraph 33, Paragraph 34, Paragraph 36, Paragraph 37, Paragraph 37, Paragraph 38 39 th paragraph to 55 th paragraph, where sum of L / number of elements of L = 165.37 / 2486 = 0.067
The sum of H / the number of elements of H = 1345.62 / 539 = 2.497
Therefore, the evaluation value is 0.0266.
[0036]
[Table 8]

[0037]
[Table 9]

[0038]
Further, when the sub-matrix is recursively divided, further internal division can be obtained. For example, for the sub-matrix from the 18th paragraph to the 27th paragraph, the evaluation value is 0.237, and the division into the 18th to 21st paragraphs and the 22nd to 27th paragraphs is obtained (Table 10). ).
[0039]
[Table 10]

[0040]
For the sub-matrices from the 39th paragraph to the 55th paragraph, the evaluation value is 0.681, and the 39th paragraph to the 40th paragraph, the 41st paragraph, the 42nd paragraph, the 43rd paragraph to the 48th paragraph, the 49th paragraph, The division into the 50th to 52nd paragraphs and the 53rd to 55th paragraphs is obtained (Table 11).
[0041]
[Table 11]

[0042]
As can be seen from Table 9, Table 10, and Table 11, as the evaluation values of the divisions increase to 0.0262, 0.237, and 0.681, respectively, the degree of division of the contents is reduced. Can be used as In other words, the smaller the index, the stronger the unity of the contents, and the larger the index, the weaker the unity.
[0043]
The document divided by the document dividing apparatus of the present invention is a document display means for identifying and displaying a document according to the division, a document processing means for processing a document for each of the divided partial documents, Used in document abstraction means for displaying an abstract using key sentence extraction processing, document management means for managing documents for each of the divided partial documents, and document search means for managing documents as search targets in divided units Is done.
[0044]
【The invention's effect】
According to the present invention, the following effects can be obtained.
(1) Since the document division method takes into account the wide-area relation between documents, it is possible to perform more appropriate document division than the conventional method based on the relation between adjacent elements.
(2) Documents can be cut out freely according to the purpose.
(3) The degree of relevance of a document can be easily and easily evaluated.
(4) Since the division is performed recursively and the hierarchical division is possible, it is possible to analyze the internal structure of the document with higher accuracy.
(5) The strength of the unity of the contents of documents between documents can be easily grasped.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing an apparatus for implementing the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Language-element extraction means, 2 ... Language-element relevance evaluation means, 3 ... Language-element relevance matrix acquisition means, 3a ... Language-element relevance matrix, 4 ... Matrix division means, D ... Electronic document, DD ... Divided document, LE. Language element group.

Claims

And language elements clipping means for storing in and storing means cut out the language elements from electronic documents, the relevance of all the entities together stored in the storage means, the common character in the language elements, word , and language elements relevancy matrix acquisition means you stored in the relevance matrix storage means between language elements determined by the ratio of the sum of each one of the words synonymous, stored to the language elements relevancy matrix storage means Language element relevance matrix , the ratio of the relevance density of elements in each sub-matrix to the relevance density of elements outside of each sub-matrix is used as an evaluation value. is divided into a sequence of relevant submatrix and a matrix dividing means for storing the partial matrix storage means, with respect to a more divided submatrix said matrix dividing means, using recursive said matrix dividing means by the dividing hierarchically document Document dividing device and butterflies.

2. The document dividing apparatus according to claim 1, wherein the language element extracting unit extracts any one of a paragraph, a sentence, and a line as a language element.

In document dividing apparatus according to claim 1 or 2, evaluation value used in the matrix splitting means, the document divider device, characterized in that represents the degree of collection of contents of divided portions matrix.