JP3578618B2 - Document splitting device - Google Patents

Document splitting device Download PDF

Info

Publication number
JP3578618B2
JP3578618B2 JP04472198A JP4472198A JP3578618B2 JP 3578618 B2 JP3578618 B2 JP 3578618B2 JP 04472198 A JP04472198 A JP 04472198A JP 4472198 A JP4472198 A JP 4472198A JP 3578618 B2 JP3578618 B2 JP 3578618B2
Authority
JP
Japan
Prior art keywords
matrix
document
language
relevance
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP04472198A
Other languages
Japanese (ja)
Other versions
JPH11242684A (en
Inventor
雅之 亀田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP04472198A priority Critical patent/JP3578618B2/en
Publication of JPH11242684A publication Critical patent/JPH11242684A/en
Application granted granted Critical
Publication of JP3578618B2 publication Critical patent/JP3578618B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Description

【0001】
【発明の属する技術分野】
本発明は、文書分割装置に関するものである。
【0002】
【従来の技術】
例えば、複数の新聞記事が一緒になっている文書に対して、キーワード抽出等を行うと、新聞記事がいくつかの分野を含んでいると様々なキーワードが混ざり合って抽出されてしまう。また、文書検索において、指定された検索語を含む文書を検索しても、大きな文書であると、検索語に関連の深い部分を探すことが必要になるが、予め分割した単位を対象とすることができれば、直ちにその分割単位に辿り着くことができる。
このように、一つの文書を内容のまとまり毎に分割することができると、様々な文書処理が容易になる、即ち、文書を内容のまとまり毎に分割することにより、有用な文書処理を実現することができる。
【0003】
文書分割の最も単純な方法としては、異なり語数の増加率に着目して、増加率の極小値を切れ目として認識する方法[1]、異なり語数の増加率の代わりに、意味レベルの単語の類似単語の結束度を文書上の位置の一定の窓幅での総和を用いる方法[2]がある。また、意味的に関連性のある語の連鎖(語彙的結束性)に着目する方法[3]がある。
これらの方法は、近傍の連鎖に着目しており、隣接間の関連に着目したボトムアップ的な視点での処理となっているが、文書分割のような処理は、トップダウン的な広い視野の処理が必要である。
【0004】
【発明が解決しようとする課題】
本発明では、隣接間の関連だけでなく、広域的な関連も考慮に入れた文書分割装置を提供するものである。
【0005】
【課題を解決するための手段】
請求項1の発明は、電子化された文書から言語要素を切り出して記憶手段に記憶する言語要素切り出し手段と、前記記憶手段に記憶された全ての言語要素同士の関連度を、言語要素内の共通の文字、単語、類義単語のいずれか毎の総和の割合により求めて言語要素間関連度行列記憶手段へ記憶する言語要素間関連度行列取得手段と、前記言語要素間関連度行列記憶手段へ記憶された言語要素間関連度行列を、各部分行列の内の要素の関連度の密度と各部分行列の外の要素の関連度の密度との比を評価値とし、該評価値を用いて最適な関連性の高い部分行列の並びに分割して部分行列記憶手段へ記憶する行列分割手段とを有し、前記行列分割手段により分割された部分行列に対して、再帰的に前記行列分割手段を用いることにより、階層的に文書を分割する文書分割装置である。
【0006】
請求項2の発明は、請求項1に記載された文書分割装置において、前記言語要素切り出し手段は、段落、文、行のいずれかを言語要素として切り出す文書分割装置である。
【0010】
請求項の発明は、請求項1または2に記載された文書分割装置において、前記行列分割手段用いる評価値は、分割された部分行列の内容のまとまりの程度を表す文書分割装置である。
【0018】
【発明の実施の形態】
図1は、本発明の文書分割装置を概略的に示したブロック図であって、図中、Dは本発明の文書分割装置による分割の対象となる電子化文書、1は言語要素切出し手段であって電子化文書から言語要素群LEを切り出す。切り出された言語要素群LEは、後述のように言語要素間関連度評価手段2によって相互の関連度が評価され、言語要素間関連度行列取得手段3は前記言語要素間関連度評価手段2の評価結果に基づいて言語要素間関連度行列3aを作成する。行列分割手段4は前記言語要素間関連度行列3aから関連度に応じて文書を分割する。
【0019】
以下、本発明をその実施例について詳しく説明する。
まず、電子化された文書は、言語要素切り出し手段により、言語要素群に切り出される。この際の言語要素とは、文書における行であったり、文であったり、段落などであり、特に、いずれかを特定するものではない。これらの言語要素を単位とする切り出しは、一定文字数によったり、句点や改行で容易に切り出すことができる。
【0020】
【表1】

Figure 0003578618
【0021】
例えば、表1の文書を改行コードあるいは行先頭から40文字目で強制的に切り出せば、表2のようになる([ ]内は、切り出し番号である)。
【0022】
【表2】
Figure 0003578618
【0023】
また、改行コードあるいは句点で切り出せば、表3のような文の切り出しを得る。
【0024】
【表3】
Figure 0003578618
【0025】
さらに、改行だけで切り出せば、表4のようになる。
【表4】
Figure 0003578618
【0026】
次に、言語要素間関連度行列取得手段及び言語要素間関連度評価手段について説明する。
切り出された言語要素群から、言語要素間関連度行列取得手段により、全ての言語要素間の関連度から成る言語要素間関連度行列が得られる。この際、2つの言語要素間の関連度は、言語要素間関連度評価手段により求められる。言語要素間の関連度の評価法は、例えば、最も簡単には、一方の言語要素内の文字の総数のうち、もう一方の言語要素内の文字と共通の文字数の割合をみればよい。あるいは、日本語の場合であれば、文字のうち文字自体に意味を持たない「かな文字」を除いた文字とするのもよい。また、文を文形態素(単語)に分割する形態素解析系を用いて共通単語の割合を用いてもよい。さらに、シソーラス辞書等を用いて異なる単語でも語彙的な関連を考慮して関連度を求めてもよい。
例えば、表3の第2文と第4文
【0027】
【表5】
Figure 0003578618
【0028】
この2文の文字総数は、第2文は7文字、第4文は41文字であり、共通文字は、「輸」、「出」、「規」、「制」、「が」、「始」、の6文字であり、第2文と第4文の関連度は、第2文から見た場合は、6/7=0.857、第4文から見た場合は、6/41=0.146となる。この場合は、2つの関連度が得られるが、両文を全体から見た場合は6×2/(7+41)=12/48=0.25を一つの関連度として用いてもよい。
また、文字から平仮名と句読点を除くと、第2文は6文字、第4文は26文字、共通文字は5文字であるから、5/6=0.833と5/26=0.192、あるいは5×2/(6+26)=0.3125が得られる。
【0029】
また、形態素解析系を用いて単語分割し、一般名詞やサ変動詞を切り出すと、
2:輸出、規制、始動
4:通常、兵器、部品、加工、機械、転用、工業、製品、輸出、規制、日本
となり、第2文が3単語、第4文が11単語、共通単語が「輸出」と「規制」の2単語なので、2/3=0.667と2/11=0.182、あるいは2×2/(3+11)=4/12=0.29が得られる。
さらに異なる単語であっても、2単語間の意味的な関連度がシソーラス辞書等を用いて得ることができれば、その関連度を共通単語数に加えて計算することができる。
このようにして、全ての言語要素間の関連度を計算することによって、言語要素間関連度行列を得ることができる。例えば、言語要素を文として、関連度を共通単語の割合によるとすれば、表6のような文間関連度マトリックスを得ることができる。
【0030】
【表6】
Figure 0003578618
【0031】
なお、表6の行列要素は、関連度を10倍して四捨五入により、1桁の整数で表示してある。また、自分自身との関連度に当たる要素は‘*’で示している。
【0032】
次に、行列分割手段について説明する。
以上で得られた言語要素間関連度行列を行列分割手段が関連度の高い部分行列に分割する。
表7は段落間関連度行列から関連度の高い部分行列の並びを抽出・分割したイメージを示したものである。表中、‘H’は高い関連度、‘L’は低い関連度を示す。
【0033】
【表7】
Figure 0003578618
【0034】
上記のような部分行列に分割された場合は、分割された部分行列の範囲に応じて部分文書が対応する。
この行列の分割の指標としては、(a)部分行列の内の関連度の平均値(Hの平均値)に対する、(b)部分行列外の関連度の平均値(Lの平均値)あるいは(c)行列全体の関連度の平均値の比、等を用いてその値が最小になるような分割を計算する。この計算はよく知られた動的計画法等を用いれば効率的に行うことができる。
部分行列の内外の比を用いるとすると、表7では、
(Lの総和/Lの要素数)/( Hの総和/Hの要素数)
が評価指標となる。
【0035】
表8は、10の新聞記事を連結した文書の段落間関連度行列である(要素#は、10の意味である)。表9の関連度行列に対して、前記の指標に基づき、最適な分割を計算すると、次のような部分行列が抽出・分割される。
第 1段落〜第 6段落
第 7段落〜第13段落
第14段落〜第17段落
第18段落〜第27段落
第28段落〜第段33落
第34段落〜第36段落
第37段落〜第38段落
第39段落〜第55段落
ここで、Lの総和/Lの要素数=165.37/2486=0.067
Hの総和/Hの要素数=1345.62/539=2.497
から、評価値は、0.0266である。
【0036】
【表8】
Figure 0003578618
【0037】
【表9】
Figure 0003578618
【0038】
また、部分行列に対し、再帰的に分割を施すとさらに内部の分割を得ることができる。例えば、第18段落から第27段落の部分行列に対しては、評価値は0.237であって、第18段落〜第21段落及び第22段落〜第27段落の分割が得られる(表10)。
【0039】
【表10】
Figure 0003578618
【0040】
第39段落から第55段落の部分行列に対しては、評価値は0.681で
第39段落〜第40段落、第41段落、第42段落、第43段落〜第48段落、第49段落、第50段落〜第52段落、及び第53段落から第55段落、の分割が得られる(表11)。
【0041】
【表11】
Figure 0003578618
【0042】
表9,表10,表11の分割の評価値が各々0.0262,0.237,0.681と大きくなるにつれて、内容の分割の程度が緩くなっていることが分かるように、分割の指標として用いることができる。即ち、この指標が小さい程、内容のまとまりが強く、大きい程まとまりは弱いことを示唆している。
【0043】
また、本発明の文書分割装置によって分割された文書は、その分割に応じて文書を識別して表示する文書表示手段、前記分割された部分文書ごとに文書を処理する文書処理手段、文書からのキーセンテンス抽出処理を用いて抄録表示を行う文書抄録手段、前記分割された部分文書ごとに文書を管理する文書管理手段、及び、文書を分割された単位で検索対象として管理する文書検索手段において利用される。
【0044】
【発明の効果】
本発明によれば、次のような効果を奏する。
(1)文書間の広域的な関連をも考慮にいれた文書分割手法であるため、従来の隣接間の関連による手法に比して、より適切な文書分割を行うことができる。
(2)目的に応じて文書を自由に切り出すことができる。
(3)簡単かつ容易に文書の関連度を評価することができる。
(4)再帰的に分割を行い階層的な分割が可能であるからより精度の高い文書の内部構造分析が可能である。
(5)文書間における文書の内容のまとまりの強弱を容易に把握することができる。
【図面の簡単な説明】
【図1】本発明を実施するための装置を概略的に示すブロック図である。
【符号の説明】
1…言語要素切出し手段、2…言語要素間関連度評価手段、3…言語要素間関連度行列取得手段、3a…言語要素間関連度行列、4…行列分割手段、D…電子化文書、DD…分割文書、LE…言語要素群。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document divider equipment.
[0002]
[Prior art]
For example, if keyword extraction or the like is performed on a document in which a plurality of newspaper articles are combined, various keywords are mixed and extracted if the newspaper article includes several fields. Also, in a document search, even if a document including a specified search term is searched, if the document is a large document, it is necessary to search for a part closely related to the search term. If we can do that, we can get to that division immediately.
As described above, if one document can be divided for each unit of content, various document processes are facilitated. That is, useful document processing is realized by dividing a document for each unit of content. be able to.
[0003]
The simplest method of document segmentation is to focus on the rate of increase in the number of different words and recognize the minimum value of the rate of increase as a break [1]. There is a method [2] that uses the sum of word cohesion at a fixed window width at a position on a document. Further, there is a method [3] that focuses on semantically related word chains (lexical cohesion).
These methods focus on the chain of the neighborhood, and are processing from the bottom-up viewpoint focusing on the relationship between the neighbors. However, processing such as document division is performed from the top-down Action is required.
[0004]
[Problems to be solved by the invention]
In the present invention, not only the relationship between adjacent, wide-area related also is intended to provide a document content WariSo location taking into account.
[0005]
[Means for Solving the Problems]
The invention according to claim 1, and language elements clipping means for storing in and storing means cut out the language elements from electronic documents, the relevance of all the entities together, which is stored in the storage means, language elements common characters, words, and language elements relevancy matrix acquisition means you store the language elements relevancy matrix storage means determined by the ratio of the sum of each one of the words synonymous, between the entities relevance inner The inter-language element relevance matrix stored in the matrix storage means is used as an evaluation value, wherein the ratio of the density of relevance of elements in each submatrix to the density of relevance of elements outside each submatrix is used as an evaluation value. and a matrix dividing means for storing divided into a sequence of optimum relevant submatrix to the partial matrix storage unit by using the evaluation value for more divided submatrix said matrix dividing means, recursive by using the matrix dividing means, hierarchically sentence It is a document dividing device for dividing.
[0006]
According to a second aspect of the present invention, in the document division device according to the first aspect, the language element cutout unit cuts out any of a paragraph, a sentence, and a line as a language element.
[0010]
A third aspect of the present invention, the document dividing apparatus according to claim 1 or 2, evaluation value used in the matrix dividing means is a document divider device representing the degree of collection of contents of divided portions matrix.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 is a block diagram schematically showing a document dividing apparatus according to the present invention. In FIG. 1, D denotes an electronic document to be divided by the document dividing apparatus according to the present invention, and 1 denotes a language element extracting unit. Then, a language element group LE is cut out from the digitized document. The cut-out language element group LE is evaluated for mutual relevance by the inter-language element relevance evaluation means 2 as described later. The inter-language-element relevance matrix 3a is created based on the evaluation result. The matrix dividing means 4 divides the document according to the relevance from the inter-language element relevance matrix 3a.
[0019]
Hereinafter, the present invention will be described in detail with reference to examples.
First, the digitized document is cut out into a language element group by a language element cutting unit. The language element at this time is a line, a sentence, a paragraph, or the like in the document, and does not particularly specify any of them. The clipping in units of these language elements can be easily clipped by a fixed number of characters or by a period or a line break.
[0020]
[Table 1]
Figure 0003578618
[0021]
For example, if the document in Table 1 is forcibly cut out at the line feed code or at the 40th character from the head of the line, the document becomes as shown in Table 2 (in [], the cutout number is shown).
[0022]
[Table 2]
Figure 0003578618
[0023]
Also, if the clipping is performed at a line feed code or a period, a clipping of a sentence as shown in Table 3 is obtained.
[0024]
[Table 3]
Figure 0003578618
[0025]
Furthermore, if it is cut out only by a line feed, it becomes as shown in Table 4.
[Table 4]
Figure 0003578618
[0026]
Next, the inter-language-element relevance matrix acquisition means and the inter-language-element relevance evaluation means will be described.
From the cut-out language element group, an inter-language-element relevance matrix acquisition unit obtains an inter-language-element relevance matrix including the relevance between all the language elements. At this time, the degree of association between the two language elements is obtained by the means for evaluating the degree of association between language elements. For example, the simplest method of evaluating the degree of association between language elements is to look at the ratio of the number of characters common to the characters in the other language element to the total number of characters in the one language element. Alternatively, in the case of Japanese, the character may be a character excluding "kana character" which has no meaning in the character itself. Further, the ratio of common words may be used using a morphological analysis system that divides a sentence into sentence morphemes (words). Furthermore, the degree of relevancy may be obtained by considering the lexical relevance of different words using a thesaurus dictionary or the like.
For example, the second and fourth sentences in Table 3
[Table 5]
Figure 0003578618
[0028]
The total number of characters in these two sentences is 7 in the second sentence and 41 in the fourth sentence, and the common characters are “Yu”, “D”, “R”, “K”, “GA”, and “ ”, And the degree of association between the second sentence and the fourth sentence is 6/7 = 0.857 when viewed from the second sentence, and 6/41 = when viewed from the fourth sentence. 0.146. In this case, two relevance levels are obtained, but when both sentences are viewed as a whole, 6 × 2 / (7 + 41) = 12/48 = 0.25 may be used as one relevance level.
In addition, if hiragana and punctuation are excluded from the characters, the second sentence is 6 characters, the fourth sentence is 26 characters, and the common character is 5 characters, so that 5/6 = 0.833 and 5/26 = 0.192, Alternatively, 5 × 2 / (6 + 26) = 0.3125 is obtained.
[0029]
In addition, when words are divided using a morphological analysis system and common nouns and sa-variables are cut out,
2: export, regulation, start 4: normal, weapon, parts, processing, machinery, diversion, industry, product, export, regulation, Japan, the second sentence is 3 words, the fourth sentence is 11 words, and the common word is " Since there are two words, "export" and "restriction", 2/3 = 0.667 and 2/11 = 0.182, or 2 × 2 / (3 + 11) = 4/12 = 0.29 are obtained.
Furthermore, even if the words are different, if the semantic relevance between two words can be obtained using a thesaurus dictionary or the like, the relevance can be calculated in addition to the number of common words.
In this way, by calculating the relevance between all language elements, a relevance matrix between language elements can be obtained. For example, if the linguistic element is a sentence and the degree of relevancy depends on the ratio of common words, an inter-sentence relevance matrix as shown in Table 6 can be obtained.
[0030]
[Table 6]
Figure 0003578618
[0031]
The matrix elements in Table 6 are represented as single-digit integers by multiplying the relevance by 10 and rounding. In addition, an element corresponding to the degree of relevance to itself is indicated by '*'.
[0032]
Next, the matrix dividing means will be described.
The matrix dividing means divides the inter-language-element relevance matrix obtained above into sub-matrices having a high relevance.
Table 7 shows an image in which the arrangement of sub-matrices with high relevance is extracted and divided from the inter-paragraph relevance matrix. In the table, 'H' indicates a high degree of association and 'L' indicates a low degree of association.
[0033]
[Table 7]
Figure 0003578618
[0034]
When divided into the above-described sub-matrices, the sub-documents correspond according to the range of the divided sub-matrices.
As an index of this matrix division, (a) average value of relevance (average value of H) in the submatrix and (b) average value of relevance value (average value of L) outside the submatrix or ( c) Calculate the division that minimizes the value by using the ratio of the average value of the degree of association of the entire matrix. This calculation can be efficiently performed by using a well-known dynamic programming method or the like.
Assuming that the ratio of the inside and outside of the submatrix is used, in Table 7,
(Sum of L / number of elements of L) / (sum of H / number of elements of H)
Is an evaluation index.
[0035]
Table 8 is an inter-paragraph relevance matrix of a document in which ten newspaper articles are linked (element # means ten). When the optimal division is calculated for the association matrix shown in Table 9 based on the above-mentioned index, the following sub-matrix is extracted and divided.
Paragraph 1 to Paragraph 6, Paragraph 7, Paragraph 7, Paragraph 13, Paragraph 14, Paragraph 17, Paragraph 18, Paragraph 18, Paragraph 27, Paragraph 28, Paragraph 33, Paragraph 34, Paragraph 36, Paragraph 37, Paragraph 37, Paragraph 38 39 th paragraph to 55 th paragraph, where sum of L / number of elements of L = 165.37 / 2486 = 0.067
The sum of H / the number of elements of H = 1345.62 / 539 = 2.497
Therefore, the evaluation value is 0.0266.
[0036]
[Table 8]
Figure 0003578618
[0037]
[Table 9]
Figure 0003578618
[0038]
Further, when the sub-matrix is recursively divided, further internal division can be obtained. For example, for the sub-matrix from the 18th paragraph to the 27th paragraph, the evaluation value is 0.237, and the division into the 18th to 21st paragraphs and the 22nd to 27th paragraphs is obtained (Table 10). ).
[0039]
[Table 10]
Figure 0003578618
[0040]
For the sub-matrices from the 39th paragraph to the 55th paragraph, the evaluation value is 0.681, and the 39th paragraph to the 40th paragraph, the 41st paragraph, the 42nd paragraph, the 43rd paragraph to the 48th paragraph, the 49th paragraph, The division into the 50th to 52nd paragraphs and the 53rd to 55th paragraphs is obtained (Table 11).
[0041]
[Table 11]
Figure 0003578618
[0042]
As can be seen from Table 9, Table 10, and Table 11, as the evaluation values of the divisions increase to 0.0262, 0.237, and 0.681, respectively, the degree of division of the contents is reduced. Can be used as In other words, the smaller the index, the stronger the unity of the contents, and the larger the index, the weaker the unity.
[0043]
The document divided by the document dividing apparatus of the present invention is a document display means for identifying and displaying a document according to the division, a document processing means for processing a document for each of the divided partial documents, Used in document abstraction means for displaying an abstract using key sentence extraction processing, document management means for managing documents for each of the divided partial documents, and document search means for managing documents as search targets in divided units Is done.
[0044]
【The invention's effect】
According to the present invention, the following effects can be obtained.
(1) Since the document division method takes into account the wide-area relation between documents, it is possible to perform more appropriate document division than the conventional method based on the relation between adjacent elements.
(2) Documents can be cut out freely according to the purpose.
(3) The degree of relevance of a document can be easily and easily evaluated.
(4) Since the division is performed recursively and the hierarchical division is possible, it is possible to analyze the internal structure of the document with higher accuracy.
(5) The strength of the unity of the contents of documents between documents can be easily grasped.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing an apparatus for implementing the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Language-element extraction means, 2 ... Language-element relevance evaluation means, 3 ... Language-element relevance matrix acquisition means, 3a ... Language-element relevance matrix, 4 ... Matrix division means, D ... Electronic document, DD ... Divided document, LE. Language element group.

Claims (3)

電子化された文書から言語要素を切り出して記憶手段に記憶する言語要素切り出し手段と、前記記憶手段に記憶された全ての言語要素同士の関連度を、言語要素内の共通の文字、単語、類義単語のいずれか毎の総和の割合により求めて言語要素間関連度行列記憶手段へ記憶する言語要素間関連度行列取得手段と、前記言語要素間関連度行列記憶手段へ記憶された言語要素間関連度行列を、各部分行列の内の要素の関連度の密度と各部分行列の外の要素の関連度の密度との比を評価値とし、該評価値を用いて最適な関連性の高い部分行列の並びに分割して部分行列記憶手段へ記憶する行列分割手段とを有し、前記行列分割手段により分割された部分行列に対して、再帰的に前記行列分割手段を用いることにより、階層的に文書を分割することを特徴とする文書分割装置。And language elements clipping means for storing in and storing means cut out the language elements from electronic documents, the relevance of all the entities together stored in the storage means, the common character in the language elements, word , and language elements relevancy matrix acquisition means you stored in the relevance matrix storage means between language elements determined by the ratio of the sum of each one of the words synonymous, stored to the language elements relevancy matrix storage means Language element relevance matrix , the ratio of the relevance density of elements in each sub-matrix to the relevance density of elements outside of each sub-matrix is used as an evaluation value. is divided into a sequence of relevant submatrix and a matrix dividing means for storing the partial matrix storage means, with respect to a more divided submatrix said matrix dividing means, using recursive said matrix dividing means by the dividing hierarchically document Document dividing device and butterflies. 請求項1に記載された文書分割装置において、前記言語要素切り出し手段は、段落、文、行のいずれかを言語要素として切り出すことを特徴とする文書分割装置。2. The document dividing apparatus according to claim 1, wherein the language element extracting unit extracts any one of a paragraph, a sentence, and a line as a language element. 請求項1または2に記載された文書分割装置において、前記行列分割手段用いる評価値は、分割された部分行列の内容のまとまりの程度を表すことを特徴とする文書分割装置。In document dividing apparatus according to claim 1 or 2, evaluation value used in the matrix splitting means, the document divider device, characterized in that represents the degree of collection of contents of divided portions matrix.
JP04472198A 1998-02-26 1998-02-26 Document splitting device Expired - Fee Related JP3578618B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP04472198A JP3578618B2 (en) 1998-02-26 1998-02-26 Document splitting device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP04472198A JP3578618B2 (en) 1998-02-26 1998-02-26 Document splitting device

Publications (2)

Publication Number Publication Date
JPH11242684A JPH11242684A (en) 1999-09-07
JP3578618B2 true JP3578618B2 (en) 2004-10-20

Family

ID=12699304

Family Applications (1)

Application Number Title Priority Date Filing Date
JP04472198A Expired - Fee Related JP3578618B2 (en) 1998-02-26 1998-02-26 Document splitting device

Country Status (1)

Country Link
JP (1) JP3578618B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275280B2 (en) 2011-12-09 2016-03-01 Fuji Xerox Co., Ltd. Information processing system and method for document management

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4299963B2 (en) 2000-10-02 2009-07-22 ヒューレット・パッカード・カンパニー Apparatus and method for dividing a document based on a semantic group
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275280B2 (en) 2011-12-09 2016-03-01 Fuji Xerox Co., Ltd. Information processing system and method for document management

Also Published As

Publication number Publication date
JPH11242684A (en) 1999-09-07

Similar Documents

Publication Publication Date Title
Cucerzan Large-scale named entity disambiguation based on Wikipedia data
US6366908B1 (en) Keyfact-based text retrieval system, keyfact-based text index method, and retrieval method
EP0530993B1 (en) An iterative technique for phrase query formation and an information retrieval system employing same
US6876998B2 (en) Method for cross-linguistic document retrieval
US20050203900A1 (en) Associative retrieval system and associative retrieval method
Abbache et al. Arabic query expansion using wordnet and association rules
US20070073678A1 (en) Semantic document profiling
Attardi et al. Categorisation by Context.
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
KR20010015368A (en) A method of retrieving data and a data retrieving apparatus
US6278990B1 (en) Sort system for text retrieval
GB2375859A (en) Search engine systems
JP3594701B2 (en) Key sentence extraction device
Pedersen et al. Snippet search: A single phrase approach to text access
JP2009288870A (en) Document importance calculation system, and document importance calculation method and program
Subha et al. Quality factor assessment and text summarization of unambiguous natural language requirements
JP3578618B2 (en) Document splitting device
JPH0944523A (en) Relative word display device
Ringlstetter et al. Adaptive text correction with Web-crawled domain-dependent dictionaries
Whaley An application of word sense disambiguation to information retrieval
Besançon et al. Concept-based searching and merging for multilingual information retrieval: First experiments at clef 2003
Hernandez et al. What is this Text about?
JP2003085181A (en) Encyclopedia system
JPH11259524A (en) Information retrieval system, information processing method in information retrieval system and record medium
JP2007026116A (en) Concept search system and concept search method

Legal Events

Date Code Title Description
A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20040427

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20040608

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20040713

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20040713

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20070723

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20080723

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090723

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090723

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100723

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110723

Year of fee payment: 7

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120723

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120723

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130723

Year of fee payment: 9

LAPS Cancellation because of no payment of annual fees