JPH03252859A - Index automatic generation device - Google Patents

Index automatic generation device

Info

Publication number
JPH03252859A
JPH03252859A JP2051349A JP5134990A JPH03252859A JP H03252859 A JPH03252859 A JP H03252859A JP 2051349 A JP2051349 A JP 2051349A JP 5134990 A JP5134990 A JP 5134990A JP H03252859 A JPH03252859 A JP H03252859A
Authority
JP
Japan
Prior art keywords
contents
sentence
section
character string
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2051349A
Other languages
Japanese (ja)
Inventor
Yoshinori Kishida
岸田 芳典
Haruo Kimoto
木本 晴夫
Toshinori Iwadera
巖寺 俊哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2051349A priority Critical patent/JPH03252859A/en
Publication of JPH03252859A publication Critical patent/JPH03252859A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To generate the index of the sentence of an electronic text file in which no specified symbol is inserted automatically and at high speed by analyzing the feature of a character string to be an index item candidate. CONSTITUTION:Each character string of an index item candidate table is morpheme-analyzed by a morpheme analyzing part 3, and the character string whose word at the end of the sentence is a noun or a suffix among the morpheme-analyzed character strings is extracted by a sentence end word analyzing part 4, and is stored in an index item table. The character string of this index item table is classified by an index classifying part 5 into a corresponding large, medium and small items according to what of the large item, the medium item and the small item determined beforehand the character of the head of the sentence of each character string coincides with, and the index table is generated. Thus, the index of the sentence of the electronic text file in which the specified symbol to be recognized as the index item is not inserted in the sentence can be generated automatically.

Description

【発明の詳細な説明】 「産業上の利用分野」 この発明は、電子化テキストファイル化された文章を入
力として、その文章の目次を自動的に生成する目次自動
生成装置に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to an automatic table of contents generation device that receives a text converted into an electronic text file and automatically generates a table of contents for the text.

「従来の技術」 従来のこの種の装置の動作例を第4図に示す。"Conventional technology" An example of the operation of a conventional device of this type is shown in FIG.

従来技術では、電子化テキストファイル11の文章上に
、あらかしめ人手により目次項目と認識される文字列の
終わりに特定な記号「*」を付加し、この特定記号をコ
ンピュータの目次項目抽出部12が検索することによっ
て目次項目を抽出し、その抽出された目次項目を目次テ
ーブル13に格納し、さらに表示器14上で目次として
表示していた。
In the conventional technology, a specific symbol "*" is manually added to the text of the electronic text file 11 at the end of a character string that is recognized as a table of contents item, and this specific symbol is added to the table of contents item extraction unit 12 of the computer. The table of contents items are extracted by searching, the extracted table of contents items are stored in the table of contents table 13, and further displayed as a table of contents on the display 14.

「発明が解決しようとする課題」 この従来の目次生成装置では文章作成において目次とな
る項目を意識しなければならず、かつ文章中の目次項目
を人手によりあらかじめ指摘する必要があるという余分
な作業が必要であった。
``Problem to be solved by the invention'' With this conventional table of contents generation device, it is necessary to be aware of the items that will become the table of contents when creating a document, and the table of contents items in the document must be manually pointed out in advance, which is an extra task. was necessary.

この発明の目的は、文章中に目次項目と認識される特定
記号が挿入されていない電子化テキストファイルの文章
の目次を自動的に生成することができる目次自動生成装
置を提供することにある。
An object of the present invention is to provide an automatic table of contents generation device that can automatically generate a table of contents of a text of an electronic text file in which no specific symbol recognized as a table of contents item is inserted into the text.

「課題を解決するための手段」 この発明によれば、電子化テキストファイルが原文入力
部で受け付けられ、その受け付けられた電子化テキスト
ファイルは目次項目候補語抽出部で、文頭から次の改行
まで、改行から次の改行まで、改行から文末までの各文
字列において、文字数が所定値、例えば35以下でがっ
文字列の最初の文字がアラビア数字・ローマ数字・漢数
字、またはマルかっこあるいはカギかっこなどで囲まれ
たアラビア数字・ローマ数字・漢数字、または記号類で
示される文字列が抽出され、その抽出された文字列は形
態素解析部で形態素解析され、その解析の結果、文字列
の文末の単語が名詞であるか、または接尾辞である文字
列が文末語解析部で抽出され、この抽出された文字列に
ついて、目次分類部で、文頭の文字がアラビア数字・ロ
ーマ数字・漢数字、またはマルかっこあるいはカギかっ
こなどで囲まれたアラビア数字・ローマ数字・漢数字、
または記号類であるかに基づいて大項目と中項目と小項
目に分類され、その分類された文字列は目次生成結果表
示部で目次として表示器に表示される。
"Means for Solving the Problem" According to the present invention, an digitized text file is received by the original text input section, and the accepted digitized text file is processed from the beginning of the sentence to the next line break by the table of contents item candidate word extraction section. , in each character string from a line break to the next line break, or from a line break to the end of the sentence, the number of characters must be a specified value, for example 35 or less A string of characters enclosed in parentheses, etc. that is represented by Arabic numerals, Roman numerals, Chinese numerals, or symbols is extracted, and the extracted string is morphologically analyzed by the morphological analysis section.As a result of the analysis, the character string is A character string in which the word at the end of a sentence is a noun or a suffix is extracted by the sentence-final word analysis unit, and the table of contents classification unit identifies whether the first character of the sentence is an Arabic numeral, Roman numeral, or Chinese numeral. , or Arabic numerals, Roman numerals, or Chinese numerals enclosed in square brackets or square brackets,
The characters are classified into large items, medium items, and small items based on whether they are symbols or symbols, and the classified character strings are displayed as a table of contents in the table of contents generation result display section.

「実施例」 第1図はこの発明の実施例を示す。1は電子化テキスト
ファイルを入力する原文入力部、2は入力された電子化
テキストファイルより目次項目候補である文字列を抽出
する目次項目候補抽出部、3は目次項目候補抽出部2よ
り抽出された複数の文字列について形態素解析を行う形
態素解析部、4は形態素解析部3にまり形態素解析され
た文字列において文末の単語の品詞を解析し、目次項目
と判定される文字列を抽出する文末語解析部、5は文末
語解析部4で抽出された文字列を大項目・中項目・小項
目に分類し目次を生成する目次分類部、6は生成された
目次を表示器(デイスプレィ)上に出力する目次生成結
果表示部である。
"Embodiment" FIG. 1 shows an embodiment of the present invention. Reference numeral 1 denotes an original text input unit that inputs a digitized text file, 2 a table of contents item candidate extraction unit that extracts character strings that are table of contents item candidates from the input digitized text file, and 3 a table of contents item candidate extraction unit that extracts character strings that are table of contents item candidates from the input digitized text file. 4 is a morphological analysis unit that performs morphological analysis on a plurality of character strings, and 4 is a sentence-final unit that analyzes the part of speech of the word at the end of a sentence in the morphologically analyzed character string and extracts a character string that is determined to be a table of contents item. A word analysis unit 5 is a table of contents classification unit that classifies the character strings extracted by the sentence final word analysis unit 4 into large items, medium items, and small items and generates a table of contents; 6 is a table of contents classification unit that displays the generated table of contents on a display. This is the table of contents generation result display section that is output to.

この実施例の動作を第2図を参照しながら説明する。The operation of this embodiment will be explained with reference to FIG.

まず、電子化テキストファイル7が原文入力部lへ入力
される。この電子化テキストファイル7には第4図中の
ファイル11と異なり、目次項目と認識される文字列の
終わりに特定記号「*」は付けられていない、この電子
化テキストファイル7は目次項目候補抽出部2で、文頭
から次の改行まで、または改行から次の改行まで、また
は改行から文末までの各文字列について、文字数が35
文字以内でかつ文中に句点が存在しないで、かつ文字列
の文頭の文字が数字・記号類で始まるものを抽出して第
2図に示すように目次項目候補テーブル8へ格納する。
First, the electronic text file 7 is input to the original text input section l. Unlike file 11 in FIG. 4, this electronic text file 7 does not have a specific symbol "*" at the end of the character string that is recognized as a table of contents item.This electronic text file 7 is a table of contents item candidate. In extraction part 2, the number of characters is 35 for each character string from the beginning of a sentence to the next line feed, from a line feed to the next line feed, or from a line feed to the end of a sentence.
Character strings that are within the range of characters, have no period in the sentence, and in which the first character of the sentence starts with a number or symbol are extracted and stored in the table of contents item candidate table 8 as shown in FIG.

この文頭の文字が数字である場合とは、アラビア数字・
ローマ数字・漢数字、またはこれら数字でマルあるいは
カギなどのかっこつきのものなどである。また記号とし
ては第2図は「O」、「・」の例である。
When the first character of this sentence is a number, it means an Arabic numeral,
Roman numerals, Chinese numerals, or these numbers in parentheses such as circles or keys. Further, as symbols, FIG. 2 shows examples of "O" and ".".

次に、この目次項目候補テーブル8の各文字列は形態素
解析部3で形態素解析される。その形態素解析された文
字列で文末の単語が名詞であるかまたは接尾辞である文
字列が文末語解析部4で抽出されて目次項目テーブル9
へ格納される。その目次項目テーブル9の文字列は、目
次分類部5で、各文字列の文頭の文字が、例えば第3図
に示すようにあらかじめ決められた大項目と、中項目と
、小項目との何れと一致するかにより、大・中・小の該
当項目へ分類されて、目次テーブル10が生成される。
Next, each character string in this table of contents item candidate table 8 is morphologically analyzed by the morphological analysis section 3. In the morphologically analyzed character string, character strings in which the final word of the sentence is a noun or a suffix are extracted by the sentence final word analysis unit 4 and are extracted from the table of contents item table 9.
is stored in The character strings in the table of contents item table 9 are determined by the table of contents classification unit 5, and the first character of each character string is determined to be one of a predetermined large item, medium item, and small item as shown in FIG. 3, for example. The table of contents table 10 is generated by classifying the items into large, medium, and small items depending on whether they match.

第3図の例では、かっこなし数字は大項目、かっこつき
数字は中項目、記号類は小項目とした場合で、従って第
2図の例は、 大項目・・・「1.あいさつ」、「2.オフトーク通信
サービスの説明」 中項目・・・「(1)サービス概要とメリット」、「(
2)料金について」 小項目・・・「○サービスの概要」、「Oメリット」と
なる。
In the example in Figure 3, the numbers in parentheses are large items, the numbers in parentheses are medium items, and the symbols are small items. Therefore, in the example in Figure 2, the large items... "1. Greetings", “2. Explanation of off-talk communication service” Middle items: “(1) Service overview and benefits”, “(
2) Regarding fees” Sub-items: “○ Service overview” and “O Benefits”.

このようにして生成された目次テーブル10は目次生成
結果表示部6により表示器上に画面表示する。
The table of contents table 10 generated in this way is displayed on a screen by the table of contents generation result display section 6 on the display.

「発明の効果」 以上で説明したように、この発明の目次自動生成装置は
、従来技術のように人手であらかじめ目次頃日と認識さ
れる文字列に対して特定な記号等を付加することにより
コンピュータがそれを検索し抽出するという方式とは異
なり、目次頃日候補となる文字列の特徴を分析すること
により、つまり、短い文字列で、文頭の文字が数字か記
号で、かつ文末が名詞か接尾辞であることにより抽出す
るため、高速かつ自動的に目次を生成することができる
"Effects of the Invention" As explained above, the automatic table of contents generation device of the present invention is capable of adding specific symbols etc. to character strings that are manually recognized as being around the table of contents in advance, unlike the prior art. Unlike the method in which a computer searches and extracts it, by analyzing the characteristics of the character string that becomes a candidate for the table of contents, we find that the character string is short, the first character of the sentence is a number or symbol, and the last character of the sentence is a noun. Since the table of contents is extracted based on whether it is a suffix or a suffix, a table of contents can be generated quickly and automatically.

このことにより、従来技術で行っていた人手であらかじ
め目次項目となる文字列を抽出するという作業がなくな
り、また文章作成の過程で目次を意識することなく作業
に携わることが可能となる効果がある。
This eliminates the need to manually extract character strings that will become table of contents items in advance, which was done with conventional technology, and also has the effect of making it possible to engage in work without being aware of the table of contents during the writing process. .

【図面の簡単な説明】[Brief explanation of drawings]

第1図はこの発明の実施例を示すブロック図、第2図は
第1図の装置の動作例を示す図、第3図は目次項目の分
類において、文頭の文字の種類により大項目・中項目・
小項目のどの項目へ分類されるかを説明するための図、
第4図は従来の目次生成装置の動作を説明するための図
である。 1:原文入力部、2:目次項目候補抽出部、3:形態素
解析部、4:文末語解析部、5:目次分類部、6;目次
生成結果表示部。 オ 1 図
Fig. 1 is a block diagram showing an embodiment of the present invention, Fig. 2 is a diagram showing an example of the operation of the device shown in Fig. 1, and Fig. 3 is a diagram showing an example of the operation of the device shown in Fig. 1. item·
A diagram to explain which subcategory it is classified into,
FIG. 4 is a diagram for explaining the operation of a conventional table of contents generation device. 1: Original text input section, 2: Table of contents item candidate extraction section, 3: Morphological analysis section, 4: Sentence final word analysis section, 5: Table of contents classification section, 6: Table of contents generation result display section. E 1 Figure

Claims (1)

【特許請求の範囲】[Claims] (1)電子テキストファイル化された文章中の目次項目
を自動的に抽出し、かつそれらを大・中・小の目次項目
へ分類し、文章の目次を自動的に作成する目次自動生成
装置であって、 原文入力部、目次項目候補抽出部、形態素解析部、文末
語解析部、目次分類部、目次生成結果表示部で構成され
、 上記原文入力部では上記電子化テキストファイルの入力
を受け付け、 上記目次項目候補語抽出部では上記原文入力部からの電
子化テキストファイルの文頭から次の改行まで、または
改行から次の改行まで、または改行から文末までの各文
字列において、文字数が所定値以内でかつ文字列の最初
の文字がアラビア数字・ローマ数字・漢数字、またはマ
ルかっこあるいはカギなどのかっこで囲まれたアラビア
数字・ローマ数字・漢数字、または記号類で示される文
字列を抽出し、 上記形態素解析部では、上記目次項目候補抽出部で抽出
した文字列について形態素解析を行い、 上記文末語解析部では、上記形態素解析部で形態素解析
を行った文字列の文末の単語が名詞であるかまたは接尾
辞である文字列を抽出し、上記目次分類部では、上記文
末語解析部で抽出した文字列について文頭の文字がアラ
ビア数字・ローマ数字・漢数字、またはマルかっこある
いはカギかっこで囲まれたアラビア数字・ローマ数字・
漢数字、または記号類であるかに基づいて、大項目と中
項目と小項目に分類し、上記目次生成結果表示部では、
上記目次分類部で分類された文字列を目次として表示器
上に表示することを特徴とした目次自動生成装置。
(1) An automatic table of contents generation device that automatically extracts table of contents items in a text converted into an electronic text file, classifies them into large, medium, and small table of contents items, and automatically creates a table of contents for the text. It consists of an original text input section, a table of contents item candidate extraction section, a morphological analysis section, a sentence final word analysis section, a table of contents classification section, and a table of contents generation result display section, and the original text input section accepts input of the digitized text file, In the above table of contents item candidate word extraction section, the number of characters in each character string from the beginning of a sentence to the next line break, from a line break to the next line break, or from a line break to the end of a sentence in the electronic text file from the original text input section is within a predetermined value. Extracts character strings whose first character is an Arabic numeral, Roman numeral, Chinese numeral, or an Arabic numeral, Roman numeral, Chinese numeral enclosed in parentheses such as a bullet or lock, or a symbol. The morphological analysis unit performs morphological analysis on the character string extracted by the table of contents item candidate extraction unit, and the sentence-final word analysis unit determines whether the sentence-final word of the character string subjected to morphological analysis by the morphological analysis unit is a noun. The above-mentioned table of contents classification section extracts a character string that is a suffix or a suffix, and determines whether the first character of the sentence is an Arabic numeral, Roman numeral, Chinese numeral, round brackets, or square brackets for the character string extracted by the sentence-final word analysis section. Surrounded Arabic numerals, Roman numerals,
Classified into large items, medium items, and small items based on whether they are Chinese numerals or symbols, and in the table of contents generation result display section above,
An automatic table of contents generation device characterized in that the character strings classified by the table of contents classification section are displayed on a display device as a table of contents.
JP2051349A 1990-03-02 1990-03-02 Index automatic generation device Pending JPH03252859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2051349A JPH03252859A (en) 1990-03-02 1990-03-02 Index automatic generation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2051349A JPH03252859A (en) 1990-03-02 1990-03-02 Index automatic generation device

Publications (1)

Publication Number Publication Date
JPH03252859A true JPH03252859A (en) 1991-11-12

Family

ID=12884451

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2051349A Pending JPH03252859A (en) 1990-03-02 1990-03-02 Index automatic generation device

Country Status (1)

Country Link
JP (1) JPH03252859A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010527051A (en) * 2007-03-30 2010-08-05 グーグル・インコーポレーテッド Document processing for mobile devices

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62249270A (en) * 1986-04-23 1987-10-30 Toshiba Corp Document processor
JPS6359658A (en) * 1986-08-30 1988-03-15 Canon Inc Document processor
JPS644864A (en) * 1987-06-26 1989-01-10 Nec Corp Sentence processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62249270A (en) * 1986-04-23 1987-10-30 Toshiba Corp Document processor
JPS6359658A (en) * 1986-08-30 1988-03-15 Canon Inc Document processor
JPS644864A (en) * 1987-06-26 1989-01-10 Nec Corp Sentence processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010527051A (en) * 2007-03-30 2010-08-05 グーグル・インコーポレーテッド Document processing for mobile devices

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
EP0394633A2 (en) Method for language-independent text tokenization using a character categorization
JP2002197104A (en) Device and method for data retrieval processing, and recording medium recording data retrieval processing program
US20060143555A1 (en) Apparatus and method for extracting information from a formatted document
JPH03252859A (en) Index automatic generation device
JP3398729B2 (en) Automatic keyword extraction device and automatic keyword extraction method
US6523031B1 (en) Method for obtaining structured information exists in special data format from a natural language text by aggregation
JPH03105465A (en) Compound word extraction device
JPS61248160A (en) Document information registering system
JPH0474259A (en) Document summarizing device
Ikeda et al. Expressive power of tree and string based wrappers
JP2006301296A (en) Document display device and method
JPH0612453A (en) Unknown word extracting and registering device
Pramoda Devi et al. A Comparative Study on Various Approaches and Complexities of Text Summarization
JP2899184B2 (en) Japanese morphological analysis system and heading extraction method
JPH05181853A (en) Document processing system
JPH0944496A (en) Method and device for analyzing natural language
JPH02255970A (en) Sentence presentation device
JPH07182348A (en) Translation system
JPH07141347A (en) Method for segmenting japanese character string
JPS63200290A (en) Character recognizing system
JPH04188364A (en) Device for extracting intrinsic wording of japanese sentence
JPH07219953A (en) Document summarizing device
JPH07141369A (en) Japanese-language character string segmenting method
JPH0793313A (en) Document processor