JPH03252859A

JPH03252859A - Index automatic generation device

Info

Publication number: JPH03252859A
Application number: JP2051349A
Authority: JP
Inventors: Yoshinori Kishida; 岸田　芳典; Haruo Kimoto; 木本　晴夫; Toshinori Iwadera; 巖寺　俊哲
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1990-03-02
Filing date: 1990-03-02
Publication date: 1991-11-12

Abstract

PURPOSE:To generate the index of the sentence of an electronic text file in which no specified symbol is inserted automatically and at high speed by analyzing the feature of a character string to be an index item candidate. CONSTITUTION:Each character string of an index item candidate table is morpheme-analyzed by a morpheme analyzing part 3, and the character string whose word at the end of the sentence is a noun or a suffix among the morpheme-analyzed character strings is extracted by a sentence end word analyzing part 4, and is stored in an index item table. The character string of this index item table is classified by an index classifying part 5 into a corresponding large, medium and small items according to what of the large item, the medium item and the small item determined beforehand the character of the head of the sentence of each character string coincides with, and the index table is generated. Thus, the index of the sentence of the electronic text file in which the specified symbol to be recognized as the index item is not inserted in the sentence can be generated automatically.

Description

【発明の詳細な説明】「産業上の利用分野」この発明は、電子化テキストファイル化された文章を入
力として、その文章の目次を自動的に生成する目次自動
生成装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an automatic table of contents generation device that receives a text converted into an electronic text file and automatically generates a table of contents for the text.

「従来の技術」従来のこの種の装置の動作例を第４図に示す。"Conventional technology" An example of the operation of a conventional device of this type is shown in FIG.

従来技術では、電子化テキストファイル１１の文章上に
、あらかしめ人手により目次項目と認識される文字列の
終わりに特定な記号「＊」を付加し、この特定記号をコ
ンピュータの目次項目抽出部１２が検索することによっ
て目次項目を抽出し、その抽出された目次項目を目次テ
ーブル１３に格納し、さらに表示器１４上で目次として
表示していた。In the conventional technology, a specific symbol "*" is manually added to the text of the electronic text file 11 at the end of a character string that is recognized as a table of contents item, and this specific symbol is added to the table of contents item extraction unit 12 of the computer. The table of contents items are extracted by searching, the extracted table of contents items are stored in the table of contents table 13, and further displayed as a table of contents on the display 14.

「発明が解決しようとする課題」この従来の目次生成装置では文章作成において目次とな
る項目を意識しなければならず、かつ文章中の目次項目
を人手によりあらかじめ指摘する必要があるという余分
な作業が必要であった。``Problem to be solved by the invention'' With this conventional table of contents generation device, it is necessary to be aware of the items that will become the table of contents when creating a document, and the table of contents items in the document must be manually pointed out in advance, which is an extra task. was necessary.

この発明の目的は、文章中に目次項目と認識される特定
記号が挿入されていない電子化テキストファイルの文章
の目次を自動的に生成することができる目次自動生成装
置を提供することにある。An object of the present invention is to provide an automatic table of contents generation device that can automatically generate a table of contents of a text of an electronic text file in which no specific symbol recognized as a table of contents item is inserted into the text.

「課題を解決するための手段」この発明によれば、電子化テキストファイルが原文入力
部で受け付けられ、その受け付けられた電子化テキスト
ファイルは目次項目候補語抽出部で、文頭から次の改行
まで、改行から次の改行まで、改行から文末までの各文
字列において、文字数が所定値、例えば３５以下でがっ
文字列の最初の文字がアラビア数字・ローマ数字・漢数
字、またはマルかっこあるいはカギかっこなどで囲まれ
たアラビア数字・ローマ数字・漢数字、または記号類で
示される文字列が抽出され、その抽出された文字列は形
態素解析部で形態素解析され、その解析の結果、文字列
の文末の単語が名詞であるか、または接尾辞である文字
列が文末語解析部で抽出され、この抽出された文字列に
ついて、目次分類部で、文頭の文字がアラビア数字・ロ
ーマ数字・漢数字、またはマルかっこあるいはカギかっ
こなどで囲まれたアラビア数字・ローマ数字・漢数字、
または記号類であるかに基づいて大項目と中項目と小項
目に分類され、その分類された文字列は目次生成結果表
示部で目次として表示器に表示される。"Means for Solving the Problem" According to the present invention, an digitized text file is received by the original text input section, and the accepted digitized text file is processed from the beginning of the sentence to the next line break by the table of contents item candidate word extraction section. , in each character string from a line break to the next line break, or from a line break to the end of the sentence, the number of characters must be a specified value, for example 35 or less A string of characters enclosed in parentheses, etc. that is represented by Arabic numerals, Roman numerals, Chinese numerals, or symbols is extracted, and the extracted string is morphologically analyzed by the morphological analysis section.As a result of the analysis, the character string is A character string in which the word at the end of a sentence is a noun or a suffix is extracted by the sentence-final word analysis unit, and the table of contents classification unit identifies whether the first character of the sentence is an Arabic numeral, Roman numeral, or Chinese numeral. , or Arabic numerals, Roman numerals, or Chinese numerals enclosed in square brackets or square brackets,
The characters are classified into large items, medium items, and small items based on whether they are symbols or symbols, and the classified character strings are displayed as a table of contents in the table of contents generation result display section.

「実施例」第１図はこの発明の実施例を示す。１は電子化テキスト
ファイルを入力する原文入力部、２は入力された電子化
テキストファイルより目次項目候補である文字列を抽出
する目次項目候補抽出部、３は目次項目候補抽出部２よ
り抽出された複数の文字列について形態素解析を行う形
態素解析部、４は形態素解析部３にまり形態素解析され
た文字列において文末の単語の品詞を解析し、目次項目
と判定される文字列を抽出する文末語解析部、５は文末
語解析部４で抽出された文字列を大項目・中項目・小項
目に分類し目次を生成する目次分類部、６は生成された
目次を表示器（デイスプレィ）上に出力する目次生成結
果表示部である。"Embodiment" FIG. 1 shows an embodiment of the present invention. Reference numeral 1 denotes an original text input unit that inputs a digitized text file, 2 a table of contents item candidate extraction unit that extracts character strings that are table of contents item candidates from the input digitized text file, and 3 a table of contents item candidate extraction unit that extracts character strings that are table of contents item candidates from the input digitized text file. 4 is a morphological analysis unit that performs morphological analysis on a plurality of character strings, and 4 is a sentence-final unit that analyzes the part of speech of the word at the end of a sentence in the morphologically analyzed character string and extracts a character string that is determined to be a table of contents item. A word analysis unit 5 is a table of contents classification unit that classifies the character strings extracted by the sentence final word analysis unit 4 into large items, medium items, and small items and generates a table of contents; 6 is a table of contents classification unit that displays the generated table of contents on a display. This is the table of contents generation result display section that is output to.

この実施例の動作を第２図を参照しながら説明する。The operation of this embodiment will be explained with reference to FIG.

まず、電子化テキストファイル７が原文入力部ｌへ入力
される。この電子化テキストファイル７には第４図中の
ファイル１１と異なり、目次項目と認識される文字列の
終わりに特定記号「＊」は付けられていない、この電子
化テキストファイル７は目次項目候補抽出部２で、文頭
から次の改行まで、または改行から次の改行まで、また
は改行から文末までの各文字列について、文字数が３５
文字以内でかつ文中に句点が存在しないで、かつ文字列
の文頭の文字が数字・記号類で始まるものを抽出して第
２図に示すように目次項目候補テーブル８へ格納する。First, the electronic text file 7 is input to the original text input section l. Unlike file 11 in FIG. 4, this electronic text file 7 does not have a specific symbol "*" at the end of the character string that is recognized as a table of contents item.This electronic text file 7 is a table of contents item candidate. In extraction part 2, the number of characters is 35 for each character string from the beginning of a sentence to the next line feed, from a line feed to the next line feed, or from a line feed to the end of a sentence.
Character strings that are within the range of characters, have no period in the sentence, and in which the first character of the sentence starts with a number or symbol are extracted and stored in the table of contents item candidate table 8 as shown in FIG.

この文頭の文字が数字である場合とは、アラビア数字・
ローマ数字・漢数字、またはこれら数字でマルあるいは
カギなどのかっこつきのものなどである。また記号とし
ては第２図は「Ｏ」、「・」の例である。When the first character of this sentence is a number, it means an Arabic numeral,
Roman numerals, Chinese numerals, or these numbers in parentheses such as circles or keys. Further, as symbols, FIG. 2 shows examples of "O" and ".".

次に、この目次項目候補テーブル８の各文字列は形態素
解析部３で形態素解析される。その形態素解析された文
字列で文末の単語が名詞であるかまたは接尾辞である文
字列が文末語解析部４で抽出されて目次項目テーブル９
へ格納される。その目次項目テーブル９の文字列は、目
次分類部５で、各文字列の文頭の文字が、例えば第３図
に示すようにあらかじめ決められた大項目と、中項目と
、小項目との何れと一致するかにより、大・中・小の該
当項目へ分類されて、目次テーブル１０が生成される。Next, each character string in this table of contents item candidate table 8 is morphologically analyzed by the morphological analysis section 3. In the morphologically analyzed character string, character strings in which the final word of the sentence is a noun or a suffix are extracted by the sentence final word analysis unit 4 and are extracted from the table of contents item table 9.
is stored in The character strings in the table of contents item table 9 are determined by the table of contents classification unit 5, and the first character of each character string is determined to be one of a predetermined large item, medium item, and small item as shown in FIG. 3, for example. The table of contents table 10 is generated by classifying the items into large, medium, and small items depending on whether they match.

第３図の例では、かっこなし数字は大項目、かっこつき
数字は中項目、記号類は小項目とした場合で、従って第
２図の例は、大項目・・・「１．あいさつ」、「２．オフトーク通信
サービスの説明」中項目・・・「（１）サービス概要とメリット」、「（
２）料金について」小項目・・・「○サービスの概要」、「Ｏメリット」と
なる。In the example in Figure 3, the numbers in parentheses are large items, the numbers in parentheses are medium items, and the symbols are small items. Therefore, in the example in Figure 2, the large items... "1. Greetings", “2. Explanation of off-talk communication service” Middle items: “(1) Service overview and benefits”, “(
2) Regarding fees” Sub-items: “○ Service overview” and “O Benefits”.

このようにして生成された目次テーブル１０は目次生成
結果表示部６により表示器上に画面表示する。The table of contents table 10 generated in this way is displayed on a screen by the table of contents generation result display section 6 on the display.

「発明の効果」以上で説明したように、この発明の目次自動生成装置は
、従来技術のように人手であらかじめ目次頃日と認識さ
れる文字列に対して特定な記号等を付加することにより
コンピュータがそれを検索し抽出するという方式とは異
なり、目次頃日候補となる文字列の特徴を分析すること
により、つまり、短い文字列で、文頭の文字が数字か記
号で、かつ文末が名詞か接尾辞であることにより抽出す
るため、高速かつ自動的に目次を生成することができる
。"Effects of the Invention" As explained above, the automatic table of contents generation device of the present invention is capable of adding specific symbols etc. to character strings that are manually recognized as being around the table of contents in advance, unlike the prior art. Unlike the method in which a computer searches and extracts it, by analyzing the characteristics of the character string that becomes a candidate for the table of contents, we find that the character string is short, the first character of the sentence is a number or symbol, and the last character of the sentence is a noun. Since the table of contents is extracted based on whether it is a suffix or a suffix, a table of contents can be generated quickly and automatically.

このことにより、従来技術で行っていた人手であらかじ
め目次項目となる文字列を抽出するという作業がなくな
り、また文章作成の過程で目次を意識することなく作業
に携わることが可能となる効果がある。This eliminates the need to manually extract character strings that will become table of contents items in advance, which was done with conventional technology, and also has the effect of making it possible to engage in work without being aware of the table of contents during the writing process. .

[Brief explanation of drawings]

第１図はこの発明の実施例を示すブロック図、第２図は
第１図の装置の動作例を示す図、第３図は目次項目の分
類において、文頭の文字の種類により大項目・中項目・
小項目のどの項目へ分類されるかを説明するための図、
第４図は従来の目次生成装置の動作を説明するための図
である。１：原文入力部、２：目次項目候補抽出部、３：形態素
解析部、４：文末語解析部、５：目次分類部、６；目次
生成結果表示部。オ　１　図Fig. 1 is a block diagram showing an embodiment of the present invention, Fig. 2 is a diagram showing an example of the operation of the device shown in Fig. 1, and Fig. 3 is a diagram showing an example of the operation of the device shown in Fig. 1. item·
A diagram to explain which subcategory it is classified into,
FIG. 4 is a diagram for explaining the operation of a conventional table of contents generation device. 1: Original text input section, 2: Table of contents item candidate extraction section, 3: Morphological analysis section, 4: Sentence final word analysis section, 5: Table of contents classification section, 6: Table of contents generation result display section. E 1 Figure

Claims

[Claims]

(1) An automatic table of contents generation device that automatically extracts table of contents items in a text converted into an electronic text file, classifies them into large, medium, and small table of contents items, and automatically creates a table of contents for the text. It consists of an original text input section, a table of contents item candidate extraction section, a morphological analysis section, a sentence final word analysis section, a table of contents classification section, and a table of contents generation result display section, and the original text input section accepts input of the digitized text file, In the above table of contents item candidate word extraction section, the number of characters in each character string from the beginning of a sentence to the next line break, from a line break to the next line break, or from a line break to the end of a sentence in the electronic text file from the original text input section is within a predetermined value. Extracts character strings whose first character is an Arabic numeral, Roman numeral, Chinese numeral, or an Arabic numeral, Roman numeral, Chinese numeral enclosed in parentheses such as a bullet or lock, or a symbol. The morphological analysis unit performs morphological analysis on the character string extracted by the table of contents item candidate extraction unit, and the sentence-final word analysis unit determines whether the sentence-final word of the character string subjected to morphological analysis by the morphological analysis unit is a noun. The above-mentioned table of contents classification section extracts a character string that is a suffix or a suffix, and determines whether the first character of the sentence is an Arabic numeral, Roman numeral, Chinese numeral, round brackets, or square brackets for the character string extracted by the sentence-final word analysis section. Surrounded Arabic numerals, Roman numerals,
Classified into large items, medium items, and small items based on whether they are Chinese numerals or symbols, and in the table of contents generation result display section above,
An automatic table of contents generation device characterized in that the character strings classified by the table of contents classification section are displayed on a display device as a table of contents.