JPH04243477A

JPH04243477A - Index word extraction method for natural language processing system

Info

Publication number: JPH04243477A
Application number: JP3017044A
Authority: JP
Inventors: Masa Saito; 斎藤　雅; Hiroshi Teranishi; 浩寺西; Takahiro Nakajima; 孝浩中島
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1991-01-17
Filing date: 1991-01-17
Publication date: 1992-08-31

Abstract

PURPOSE:To automatically extract an index word from a text by applying separate writing/Kana attaching for natural language to text data and by referring to word function information for each word to extract the index word. CONSTITUTION:An input processing section 101 copies natural language processing system input data magnetic tape on to a disk file as input data 102, carries out checking of Kanji code or the like, and then converts the Kanji code or the like to a Japanese language processing record. Further, an output processing section 120 copies a result of processed file on the disk as processed result data 121 to natural language processing output magnetic tape. A driver 103 classifies/ analyses input data 102, and controls a Japanese language processing system 110, obtain a result of separate writing, Kana attaching, and keyword extraction, and edits/outputs a processed result in a natural language processing system output data format.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は、自然言語処理システ
ムを利用し、索引語の抽出を行なう本文中のデータに対
して分かち書き／カナ振りを行ない、単語毎の品詞情報
より索引語の抽出を行なうようにした自然言語処理シス
テムによる索引語抽出方法に関する。[Industrial Application Field] This invention utilizes a natural language processing system to perform separation/kana writing on data in the text from which index words are to be extracted, and to extract index words from part-of-speech information for each word. This invention relates to an index word extraction method using a natural language processing system.

【０００２】0002

【従来の技術】最近、印刷物用に蓄積した文書データを
２次利用してＣＤ−ＲＯＭやデータベースを作成するこ
とが多くなっている。そして、データベース検索用のキ
ーワードや索引語を抽出する作業は、従来より専門家に
よる手作業によっていた。2. Description of the Related Art Recently, document data accumulated for printed matter is increasingly used to create CD-ROMs and databases. The work of extracting keywords and index terms for database searches has traditionally been done manually by experts.

【０００３】0003

【発明が解決しようとする課題】データベース検索用の
キーワードを抽出する作業が、従来は専門家が文書の中
から重要語を選択し、更に読み方を付けるようになって
いる。このため、データベースのキーワードや索引語抽
出作業に多大な労力を要し、作業そのものが非効率的で
あった。[Problems to be Solved by the Invention] Conventionally, the task of extracting keywords for database searches has been to have experts select important words from documents and then add readings to the words. For this reason, the work of extracting keywords and index words from the database requires a great deal of effort, and the work itself is inefficient.

【０００４】この発明は上述のような事情より成された
ものであり、この発明の目的は、ＡＩ（人工知能）の一
分野の自然言語処理技術を利用して本文中の索引語を自
動的に抽出するための方法を提供することにある。[0004] This invention was made in view of the above-mentioned circumstances, and the purpose of this invention is to automatically retrieve index words in a text using natural language processing technology in the field of AI (artificial intelligence). The purpose is to provide a method for extracting

【０００５】[0005]

【課題を解決するための手段】この発明は自然言語処理
システムによる索引語抽出方法に関するもので、この発
明の上記目的は、予め決定している索引語をユーザ辞書
への登録を行なった後に索引語後抽出を行なう本文デー
タに対し、自然言語処理で分かち書き／カナ振りを行な
い、単語毎の品詞情報を参照して前記索引語の抽出を行
なうことによって達成される。[Means for Solving the Problems] The present invention relates to an index word extraction method using a natural language processing system, and the above object of the present invention is to register predetermined index words in a user dictionary and then extract the index words. This is achieved by applying natural language processing to the main text data for which post-word extraction is to be performed, and extracting the index word by referring to part-of-speech information for each word.

【０００６】[0006]

【作用】この発明では、本文データに対する索引語の抽
出にＡＩの一種である自然言語処理を用いており、ユー
ザ辞書を参照して入力原文データに対して分かち書き（
品詞分解）及びカナ振りを自動的に行ない、単語毎の品
詞情報（品詞情報の中に索引語情報が含まれている）を
参照して索引語の抽出を行なっている。[Operation] In this invention, natural language processing, which is a type of AI, is used to extract index words from the main text data, and the user dictionary is referenced to separate and write (
The index word is extracted by referring to the part-of-speech information (index word information is included in the part-of-speech information) for each word.

【０００７】すなわち、予め決定している書籍の索引語
を自然言語処理システムを利用して、本文中より抽出す
る。索引語のユーザ辞書への登録を行なった後に索引語
抽出を行なう本文データに対し、自然言語処理で分かち
書き／カナ振りを行ない、単語毎の品詞情報（品詞情報
の中に索引語情報が含まれている）を参照して索引語の
抽出を行なうものである。That is, predetermined index words of a book are extracted from the text using a natural language processing system. After registering the index word in the user dictionary, natural language processing is applied to the main text data from which the index word is extracted, and part-of-speech information (part-of-speech information includes index word information) is applied to each word. index terms are extracted by referring to

【０００８】[0008]

【実施例】先ず、この発明で用いる自然言語処理システ
ムについて説明する。Embodiment First, a natural language processing system used in the present invention will be explained.

【０００９】図６は自然言語処理システムのハードウエ
ア構成例を示しており、ホストマシン１０にはＣＰＵ１
１　及び実装メモリ１２が内蔵されると共に、バスライ
ン１３を介して磁気ディスク装置１４，カセット磁気テ
ープ装置１５が接続されている。ホストマシン１０には
、更に磁気テープ装置２０，レーザープリンタ２１及び
コンソール端末２３が接続されると共に、ＲＳ−２３２
Ｃ　のインターフェイス１６を介して確認／修正用端末
２２が接続されている。FIG. 6 shows an example of the hardware configuration of a natural language processing system.
1 and a mounting memory 12 are built in, and a magnetic disk device 14 and a cassette magnetic tape device 15 are connected via a bus line 13. A magnetic tape device 20, a laser printer 21, and a console terminal 23 are further connected to the host machine 10, and an RS-232
A confirmation/correction terminal 22 is connected via an interface 16 of C.

【００１０】図７は自然言語処理システムのソフトウエ
ア構成を示しており、磁気テープからの入力データは入
力処理１０１　されて取込まれ、ホストマシン１０で処
理された情報は出力処理１２０　されて磁気テープの出
力データとなる。すなわち、入力処理１０１　は自然言
語処理システム入力データ磁気テープをディスクファイ
ル上に入力データ１０２　としてコピーし、漢字コード
等のチェックを行ない、その後に日本語処理用レコード
に変換する。また、出力処理１２０　はディスク上の処
理結果ファイルを処理結果データ１２１　として自然言
語処理出力磁気テープへコピーする。ドライバ１０３　
は入力データ１０２　の分類／解析を行ない、日本語処
理システム１１０　を制御し、分かち書き，カナ振り，
キーワード抽出結果を取得し、自然言語処理システム出
力データ形式で、処理結果を編集／出力する。FIG. 7 shows the software configuration of the natural language processing system, in which input data from a magnetic tape is input through input processing 101 and taken in, and information processed by the host machine 10 is output through output processing 120 and transferred to the magnetic tape. This becomes the tape output data. That is, the input processing 101 copies the natural language processing system input data magnetic tape onto a disk file as input data 102, checks the Kanji code, etc., and then converts it into a record for Japanese processing. Further, the output processing 120 copies the processing result file on the disk to the natural language processing output magnetic tape as processing result data 121 . Driver 103
classifies/analyzes the input data 102, controls the Japanese language processing system 110, and performs parting, kana writing,
Obtain the keyword extraction results and edit/output the processing results in the natural language processing system output data format.

【００１１】日本語処理システム１１０　は基本辞書ア
クセスルーチン１１２　を介して形態素解析を行ない、
言語処理で認定する全ての単語についてその読みを抽出
し、カナ振り出力文として出力する。名詞列抽出は言語
処理による単語認定結果で、その品詞が次の（ａ），（
ｂ）　に該当するときに名詞として抽出する。（ａ）　一般名詞，サ変型名詞，形動型名詞，転成名詞
，時詞，数詞，固有名詞，代名詞、形式名詞　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　（ｂ）　接
辞についてはそれぞれ前後の品詞が以下に該当するとき
、該当単語を名詞として抽出する。The Japanese language processing system 110 performs morphological analysis via the basic dictionary access routine 112.
The pronunciations of all the words recognized through language processing are extracted and output as kana script output sentences. Noun string extraction is the result of word recognition through language processing, and the part of speech is the following (a), (
b) Extract as a noun when applicable. (a) Common nouns, modified nouns, morphological nouns, transpositional nouns, temporal nouns, numerals, proper nouns, pronouns, formal nouns

(b) For affixes, when the parts of speech before and after each fall under the following, the corresponding word is extracted as a noun.

【００１２】■接頭辞の場合後方品詞：一般名詞，サ変型名詞，形動型名詞，転成名
詞，時詞，数詞，固有名詞，代名詞，形式名詞■接尾辞
の場合前方品詞：一般名詞，サ変型名詞，形動型名詞，転成名
詞，時詞，数詞，固有名詞，代名詞，形式名詞また、日
本語文章と上記より求められたキーワード分析テーブル
を入力すると共に、統計的解析，構文解析，知識処理等
の手法を用いてアクセスファイルルーチン１１１　と協
働して入力日本語文章の解析を行ない、キーワード抽出
，絞り込み，重要度評価を行なう。■For prefixes, backward parts of speech: common nouns, S-flex nouns, morphological nouns, transpositional nouns, temporal nouns, numerals, proper nouns, pronouns, formal nouns.■For suffixes, forward parts of speech: common nouns, S-flex nouns. Type nouns, verb type nouns, transposition nouns, temporal nouns, numerals, proper nouns, pronouns, formal nouns, as well as inputting Japanese sentences and the keyword analysis table obtained from the above, statistical analysis, syntactic analysis, and knowledge The input Japanese text is analyzed in cooperation with the access file routine 111 using processing techniques, and keyword extraction, narrowing down, and importance evaluation are performed.

【００１３】端末通信処理１２３　は確認／修正用端末
２２との間で通信を行ない、端末出力用のデータ変換を
行なう。そして、端末からの修正データを出力ファイル
の形式に変換して書込む。また、リスト出力処理１２２
　は、端末から出力依頼のあった処理結果データをプリ
ンタ出力用データに編集すると共に、プリンタ出力用デ
ータをレーザープリンタ２１に出力する。The terminal communication processing 123 communicates with the confirmation/correction terminal 22 and converts data for terminal output. Then, the modified data from the terminal is converted into an output file format and written. In addition, list output processing 122
edits the processing result data requested to be output from the terminal into printer output data, and outputs the printer output data to the laser printer 21.

【００１４】ところで、ホストマシン１０が扱い得る自
然言語処理機能は、Ａ．処理種１：分かち書きＢ．処理種２：カナ振りａ（分かち書き単位のカナ振り
）Ｃ．処理種３：カナ振りｂ（漢字単位のカナ振り，総
ルビ振り）Ｄ．処理種４：キーワード抽出及びキーワードへのカナ
振りの４種であり、入力ファイルのレコード単位に上記各機
能を切替えて処理することができる。By the way, the natural language processing functions that the host machine 10 can handle are A. Processing type 1: Partition B. Processing type 2: Kana-furi a (Kana-furi for dividing lines) C. Processing type 3: Kana-furi b (Kana-furi for kanji units, total ruby-furi) D. Processing type 4: There are four types: keyword extraction and kana translation for keywords, and each of the above functions can be switched and processed for each record of the input file.

【００１５】次に、各機能（処理種１〜４）について説
明する。Ａ．分かち書き（処理種１）：日本語文章（漢字かな交
じり文）を入力して分かち書きを行ない、名詞，動詞，
形容詞について品詞情報を付加する。出力される情報は
、スラッシュ“／”による分かち書きと品詞情報（名詞
，動詞，形容詞，未知語）である。処理種１の出力形式
は図８のようになる。Ｂ．カナ振りａ（処理種２；分かち書き単位のカナ振り
）：日本語文章（漢字かな交じり分）を入力して分かち
書きを行ない、分かち書きされた単語単位にカナ振りを
行なう。読みはカタカナで振られ、名詞，動詞，形容詞
については品詞情報を付加する。そして、出力される情
報は、スラッシュによる分かち書き，品詞情報（名詞，
動詞，形容詞，未知語），分かち書き単語要素へのカナ
振り結果である。処理種２の出力形式は図９のようにな
る。Ｃ．カナ振りｂ（処理種３）：この処理種３は、分野別
辞書１０６　を使用したカナ振り及び総ルビ振り（漢字
（列）単位のカナ振り）の機能を有している。分野別辞
書１０６　を使用したカナ振りは人名，地名，各種専門
用語等の項目データに対して、品目専用の辞書を利用し
てカナ振りを行なうものである。かな振りの方法は項目
データをＫＥＹ　にして分野別辞書１０６　をサーチし
、マッチングした場合に分野別辞書１０６　に登録され
ているカナを振る。これでカナが得られなかった場合、
日本語処理システムを呼出して基本辞書１１５　によっ
てカナを振る。Next, each function (processing types 1 to 4) will be explained. A. Partitioning (processing type 1): Inputs Japanese sentences (sentences with kanji and kana) and performs parting, and extracts nouns, verbs,
Add part-of-speech information about adjectives. The output information is separated by slashes "/" and part-of-speech information (nouns, verbs, adjectives, unknown words). The output format of processing type 1 is as shown in FIG. B. Kana-furi a (processing type 2; kana-furi for each separated word): A Japanese sentence (including Kanji and kana) is input, separated words are performed, and kana-furi is performed for each separated word. Readings are given in katakana, and part-of-speech information is added for nouns, verbs, and adjectives. The output information includes slashes, part-of-speech information (nouns,
(verbs, adjectives, unknown words), and the results of kana translation for separated word elements. The output format of processing type 2 is as shown in FIG. C. Kana-furi b (processing type 3): This processing type 3 has the function of kana-furi and total ruby-furi (kana-furi for each kanji (column)) using the field-specific dictionary 106. Kana-furi using the field-specific dictionary 106 is to perform kana-furi for item data such as people's names, place names, various technical terms, etc. using an item-specific dictionary. The kana-furi method searches the field-specific dictionary 106 using the item data as KEY, and when a match is found, moves the kana registered in the field-specific dictionary 106. If this does not yield kana,
Call up the Japanese language processing system and use the basic dictionary 115 to write kana.

【００１６】データの入力形式は、単項目データの場合
は“項目データ”であり、複数項目データを１レコード
で処理する場合は、“項目データ１”／“項目データ２
”／………／“項目データＮ”のように各項目データを
スラッシュで区切るようにしている。そして、出力され
る情報は、入力項目データに対する読み（カタカナ）と
カナデータの典拠辞書識別（どの辞書に基づいてカナが
振られたかの識別）である。処理種３の出力形式は図１
０のようになっており、■分野別辞書１０６　で読みが
取得された場合、■基本辞書１１５　で読みが取得され
た場合、■分野別辞書１０６　及び基本辞書１１５　の
両方共に読みが登録されていない場合、に分けて識別コ
ード（例えばＡＡ，ＡＢ，ＡＣ）　を与えている。[0016] The data input format is "item data" for single item data, and "item data 1"/"item data 2" when processing multiple item data in one record.
Each item data is separated by a slash like ``/....../``Item data N''.Then, the output information includes the pronunciation (katakana) for the input item data and the authority dictionary identification of the kana data ( (identification of which dictionary the kana was cast based on).The output format of processing type 3 is shown in Figure 1.
0, ■ If the reading is acquired in the field-specific dictionary 106, ■ If the reading is acquired in the basic dictionary 115, ■ The reading is registered in both the field-specific dictionary 106 and the basic dictionary 115. If there is no such code, separate identification codes (eg AA, AB, AC) are given.

【００１７】分野別辞書１０６　を使用したカナ振りで
処理対象となるデータは、人名，地名，各種専門用語等
の項目データ（主に固有名詞）であり、総ルビ振りで処
理対象となるデータは日本語の漢字かな交じり文である
。総ルビ振り（漢字（列）単位のカナ振り）の機能は、
日本語文章（漢字かな交じり文）を入力して全ての漢字
に対してカナ振りを行なうものである。カナ振り方法は
、入力原文中の漢字（列）（ＪＩＳ　非漢字以外）に対
してカナ（ルビ）を振り、ルビは「群扱いルビ」の形式
で振られる。その出力形式は図１１のようになっている
。Ｄ．キーワード抽出及びキーワードへのカナ振り（処理
種４）：入力した日本語文章から日本語処理システムの
言語処理機能によりフリーキーワードの抽出を行ない、
抽出したキーワードに読みを付加する。The data to be processed in kana-furi using the field-specific dictionary 106 is item data (mainly proper nouns) such as personal names, place names, and various technical terms, and the data to be processed in total ruby-furi is It is a mixture of Japanese kanji and kana. The function of total ruby-furi (kana-furi for each kanji (column)) is as follows.
It inputs a Japanese sentence (a mixture of kanji and kana) and converts all kanji into kana. In the kana casting method, kana (ruby) is cast for kanji (sequences) (other than JIS non-kanji) in the input original text, and the ruby is cast in a "group ruby" format. The output format is as shown in FIG. D. Keyword extraction and kana translation to keywords (processing type 4): Free keywords are extracted from the input Japanese text using the language processing function of the Japanese processing system,
Add readings to the extracted keywords.

【００１８】出力される情報は、抽出されたキーワード
，キーワードの読み（カタカナ）及びキーワードの解析
結果であり、出力形式は図１２のようになっている。な
お、解析情報は、日本語処理システムによるキーワード
認定の過程で得られた解析情報がセットされるエリアで
ある。The output information is the extracted keyword, the keyword pronunciation (Katakana), and the keyword analysis result, and the output format is as shown in FIG. Note that the analysis information is an area where analysis information obtained in the process of keyword recognition by the Japanese language processing system is set.

【００１９】確認／修正用端末２２の機能は、処理結果
ファイルの中の入力原文データと処理結果データ１２１
　をホストマシン１０より端末通信処理１２３　を介し
て受け取り、端末装置のディスプレイに表示し、ホスト
マシン１０のレーザープリンタ２１に出力することによ
り処理結果の確認及び修正作業を容易に行なうことを目
的とする。端末２２からのキーボード操作により、確認
／修正を行なう処理結果ファイルのジョブ名指定を行な
い、１レコード毎に入力原文データと処理結果データ１
２１　を端末装置のデイスプレイ上に表示し、確認／修
正作業を行なう。ディスプレイの表示形式は、処理種に
より以下（Ａ）　〜（Ｄ）　のようになっている。（Ａ）　処理種１（分かち書き）の場合は、入力原文と
処理された入力原文の分かち書き結果を画面出力する。（Ｂ）　処理種２（分かち書き単位のカナ振り）の場合
は、入力原文と処理された入力原文の分かち書き単位の
カナ振り結果を画面出力する。（Ｃ）　処理種３（総ルビ振り）の場合は、入力原文中
の全ての漢字に対してのカナ振り結果を表示色を変えて
画面出力する。（Ｄ）　処理種４（キーワード抽出）の場合は、入力原
文と入力原文中から抽出されたキーワード及びそのカナ
振り結果を画面出力する。The function of the confirmation/correction terminal 22 is to check input original text data and processing result data 121 in the processing result file.
is received from the host machine 10 via the terminal communication processing 123, displayed on the display of the terminal device, and outputted to the laser printer 21 of the host machine 10, thereby facilitating confirmation and correction of the processing results. . By operating the keyboard from the terminal 22, specify the job name of the processing result file to be checked/modified, and input original text data and processing result data 1 for each record.
21 is displayed on the display of the terminal device, and confirmation/correction work is performed. The display formats are as follows (A) to (D) depending on the type of processing. (A) In the case of processing type 1 (separation), the input original text and the result of the separation between the processed input original text are output on the screen. (B) In the case of processing type 2 (Kana-Furi in units of parting lines), the input original text and the result of Kana-Furi in units of parting lines of the input original text that has been processed are output on the screen. (C) In the case of processing type 3 (total ruby writing), the kana writing results for all kanji in the input original text are output on the screen with different display colors. (D) In the case of processing type 4 (keyword extraction), the input original text, the keywords extracted from the input original text, and their kana translation results are output on the screen.

【００２０】次に、キーボード操作により処理結果デー
タの修正を行なうが基本的な修正機能を以下に挙げて説
明する。Next, the processing result data is corrected by keyboard operation, and the basic correction functions will be listed and explained below.

【００２１】処理種３及び処理種４の場合のみ修正が可
能である。処理種３（総ルビ振り）の場合はカナ振り結
果の修正が可能であり、処理種４（キーワード抽出）の
場合はカナ振り結果の修正及びキーワードの挿入，削除
，順位の入れ替えが可能である。端末２２で処理結果デ
ータ１２１　の修正があった場合、キーボード操作によ
って修正後データをホストマシン１０に送信する。ホス
トマシン１０では、修正後データを基に処理結果ファイ
ルのレコード更新を行なう。Correction is possible only in the case of processing type 3 and processing type 4. In the case of processing type 3 (total ruby swing), it is possible to modify the kana swing results, and in the case of processing type 4 (keyword extraction), it is possible to modify the kana swing results, insert/delete keywords, and change the ranking. . When the processing result data 121 is modified on the terminal 22, the modified data is sent to the host machine 10 by a keyboard operation. The host machine 10 updates the record of the processing result file based on the corrected data.

【００２２】一方、端末２２からのキーボード操作によ
り、ホストマシン１０のレーザープリンタ２１に指定さ
れた処理結果ファイルあるいはレコードのプリンタ出力
を行なう。オペレータによるＰキー（プリントキー）の
押下による処理結果ファイルあるいは処理結果レコード
単位のプリント出力要求があった場合、処理種毎のフォ
ーマットに合せてホストマシン１０から取り出したレコ
ードのプリンタ出力を行なう。On the other hand, a keyboard operation from the terminal 22 causes the laser printer 21 of the host machine 10 to print out the designated processing result file or record. When an operator presses the P key (print key) to request a printout of a processing result file or processing result record, the record taken out from the host machine 10 is output to the printer in accordance with the format of each processing type.

【００２３】以上が自然言語処理システムの概要である
が、この発明は上記自然言語処理システムを用いて本文
中の索引語を自動的に抽出するものである。図１はこの
発明の処理フローを示しており、磁気記憶媒体等に格納
された本文マスターに対して先ず前処理を行なう（ステ
ップＳ１０）。前処理の詳細は図２に示すようになって
おり、最初にデータのコード変換を行ない（ステップＳ
１１）、コード変換されたデータに対して自然言語処理
入力ファイルを作成し（ステップＳ１２）、全データに
対して上記動作を繰り返す。コード変換データはＪＩＳ
　コード及びＣＴＳ（Ｃｏｍｐｕｔｅｒｉｚｅｄ　Ｔｙ
ｐｅ　Ｓｅｔｔｉｎｇ）コードで作成されている場合が
多い。自然言語処理システムのコード体系は一般的にシ
ステム固有コードであるため、データのコード変換を行
なう必要がある。また、自然言語処理入力ファイル作成
は、コード変換したデータ毎に自然言語処理入力ファイ
ルレコードの作成を行なうものである。The above is an overview of the natural language processing system, and the present invention uses the above natural language processing system to automatically extract index words in a text. FIG. 1 shows the processing flow of the present invention, in which preprocessing is first performed on a text master stored in a magnetic storage medium or the like (step S10). The details of the preprocessing are shown in Figure 2. First, data code conversion is performed (step S
11) A natural language processing input file is created for the code-converted data (step S12), and the above operation is repeated for all data. Code conversion data is JIS
Code and CTS (Computerized Ty
pe Setting) code. Since the coding system of a natural language processing system is generally a system-specific code, it is necessary to perform code conversion of data. Furthermore, natural language processing input file creation involves creating a natural language processing input file record for each code-converted data.

【００２４】予め、決定している索引語のユーザ辞書へ
の登録を行なう。登録を行なう際に、登録する単語が索
引語であるという識別子も同時にユーザ辞書に登録する
。つまり、索引語をユーザ辞書に登録する場合には、通
常の登録以外に索引語であるという識別子の登録も行な
う。上述のように前処理されたデータは次のステップＳ
１で自然言語処理される。図１８に示すような基本サー
ビス辞書（システム辞書＋ユーザ辞書）を参照して、入
力原文データの分かち書き（品詞分解）及びカナ振りを
行なう。分かち書きされたデータの直前には、その単語
の品詞識別ＩＤが付加されており、その中に索引語識別
ＩＤが含まれているために索引語を判別出来る。図１９
はその例を示す。[0024] The predetermined index words are registered in the user dictionary. When registering, an identifier indicating that the word to be registered is an index word is also registered in the user dictionary at the same time. That is, when registering an index word in a user dictionary, in addition to the normal registration, an identifier indicating that the word is an index word is also registered. The data preprocessed as described above is subjected to the next step S.
Natural language processing is performed in 1. With reference to the basic service dictionary (system dictionary + user dictionary) as shown in FIG. 18, the input original text data is separated (part-of-speech decomposition) and kana translation. Immediately before the separated data, the part-of-speech identification ID of the word is added, and since the index word identification ID is included therein, the index word can be identified. Figure 19
shows an example.

【００２５】自然言語処理では自然言語処理入力ファイ
ルを作成し、自然言語処理で基本辞書１１５　（システ
ム辞書１３１　＋ユーザ辞書１３２）を参照して、図３
に示すような入力原文データに対して第４図に示すよう
に分かち書き（品詞分解）及びカナ振りを行なう。分か
ち書きされたデータの直前にはその単語の品詞識別ＩＤ
が付加されており、単語の品詞を判別できるようになっ
ている。次に、自然言語処理された自然言語処理出力フ
ァイルに対して後処理を行なう（ステップＳ２０）。後
処理の詳細は図５に示すようになっており、先ず索引語
判別を行なう（ステップＳ２１）。すなわち、分かち書
き／カナ振りの行なわれたデータの品詞情報の索引語識
別ＩＤより、索引語の判別を行なう。そして、索引語抽
出を行なうが（ステップＳ２２）、これは判別された索
引語に記号の付加やアンダーラインの付加を行なうもの
である。例えば・記号付加［大日本印刷］株式会社は［ＡＩ］の一分野である［自
然言語処理］を利用して文書データベースのカナ振り、
キーワード抽出を自動的に行なうシステムを開発した。・アンダーライン（組版がズレない利点がある）（ここ
ではアンダーラインを“　　　　”で示す）“大日本印
刷”株式会社は“ＡＩ”の一分野である“自然言語処理
”を利用して文書データベースのカナ振り、キーワード
抽出を自動的に行なうシステムを開発した。In the natural language processing, a natural language processing input file is created, and the basic dictionary 115 (system dictionary 131 + user dictionary 132) is referred to in the natural language processing.
As shown in FIG. 4, the input original text data shown in FIG. Immediately before the separated data is the part-of-speech identification ID of the word.
is added so that the part of speech of a word can be determined. Next, post-processing is performed on the natural language processing output file that has undergone natural language processing (step S20). The details of the post-processing are shown in FIG. 5, and first, index word discrimination is performed (step S21). That is, the index word is determined based on the index word identification ID of the part-of-speech information of the data that has been separated/kana-written. Then, index words are extracted (step S22), which involves adding symbols or underlining to the identified index words. For example, symbol addition [Dainippon Printing] Co., Ltd. uses [natural language processing], which is a field of [AI], to add kana characters to document databases,
We have developed a system that automatically extracts keywords.・Underlining (the advantage is that the typesetting does not shift) (Here, underlining is indicated by “ ”) “Dai Nippon Printing” Co., Ltd. is creating a document database using “natural language processing”, which is a field of “AI”. We have developed a system that automatically extracts keywords using kana spelling.

【００２６】この後にコード変換を行なう（ステップＳ
２３）。自然言語処理システムの処理結果はシステム固
有コードで出力されるので、ＣＴＳ　コードへのコード
変換を行ない（ステップＳ２３）、次にデータベースの
作成を行なう（ステップＳ２４）。つまり、索引語抽出
された本文データのデータベースへの登録を行なう。After this, code conversion is performed (step S
23). Since the processing results of the natural language processing system are output as system-specific codes, the codes are converted into CTS codes (step S23), and then a database is created (step S24). That is, the text data from which the index word has been extracted is registered in the database.

【００２７】基本辞書１１５　は自然言語処理（分かち
書き／カナ振り）を行なう上で一番基本となる辞書で、
システム辞書１３１　とユーザ辞書１３２　とから構成
されている。ユーザ辞書１３２　の修正を行なう事により、自然言語
処理の精度を向上する事が出来る。The basic dictionary 115 is the most basic dictionary for natural language processing (partition/kana writing).
It consists of a system dictionary 131 and a user dictionary 132. By modifying the user dictionary 132, the accuracy of natural language processing can be improved.

【００２８】この発明ではＣＴＳ　の自然言語処理の汎
用入出力ファイルとして汎用ファイル（以下、ＮＬファ
イルとする）を用いているが、ＮＬファイルでは図１３
に示すようにＮＬインファイル，ＮＬアウトファイル及
びＮＬ情報ファイルの３種類で構成され、フォーマット
は同一である。全体のフォーマットはヘダーレコード及
びデータレコードで成っており、ヘダーレコードにはレ
コード識別，シーケンス番号，ファイル識別，ジョブ名
，原稿名，ＣＴＳ　システム名等がある。また、データ
レコードとしてはレコード識別，シーケンス番号，デー
タ番号，処理種，データ等が含まれている。In this invention, a general-purpose file (hereinafter referred to as NL file) is used as a general-purpose input/output file for CTS natural language processing.
As shown in the figure, it is composed of three types: NL in file, NL out file, and NL information file, and the format is the same. The entire format consists of a header record and a data record, and the header record includes record identification, sequence number, file identification, job name, manuscript name, CTS system name, etc. Further, the data record includes record identification, sequence number, data number, processing type, data, etc.

【００２９】入力ルーチンＳ１００は図１４に示すよう
に、ＮＬインファイルをパラメータと共に読込んで自然
言語処理入力ファイル及びＮＬ情報ファイルを作成する
ようになっており、その詳細は図１５に示すようになっ
ている。ＮＬインファイルを読込んで、パラメータの指
定によるファンクションの削除及びコード変換（外部→
システム固有コード）　を行ない、自然言語処理入力フ
ァイルを作成する。削除したファンクションの位置情報
及びコード変換情報は、情報ファイルに格納し、処理終
了後にジョブ名等をリスト出力する。パラメータチェッ
ク（ステップＳ１０１）　では、ファンクション削除実
行の有無及びコード変換情報の指示の解析を行なう。ヘ
ダーレコード作成（ステップＳ１０２）では、ＮＬイン
ファイルのヘダーレコードの内容より、自然言語処理入
力ファイル及びＮＬ情報ファイルのヘダーレコードを作
成する。同データＮＯのデータの読込２１（ステップＳ
１０３）　の処理は、同データＮＯを持つレコードの全
有効データを処理単位とする。従って、ＮＬインファイ
ルデータレコード中の同データＮＯを持つデータレコー
ドから有効データを抽出する。データの加工（ステップ
Ｓ１０４）では、ＮＬインファイルから抽出したデータ
のファンクションの削除及びコード変換を行なう。削除
したファンクションの情報及びコード変換情報はＮＬ情
報ファイルへ、処理されたデータは自然言語処理入力フ
ァイルに出力する。また、データレコードの作成（ステ
ップＳ１０５）　では、同データＮＯの加工後（ファン
クションの削除，コード変換）のデータを自然言語処理
入力ファイルへ出力し、加工情報をＮＬ情報ファイルへ
出力する。As shown in FIG. 14, the input routine S100 reads the NL in-file together with parameters to create a natural language processing input file and NL information file, and its details are shown in FIG. ing. Load the NL in-file, delete functions by specifying parameters, and convert code (external →
system-specific code) to create a natural language processing input file. The position information and code conversion information of the deleted function are stored in an information file, and the job name etc. are output as a list after the processing is completed. In the parameter check (step S101), the presence or absence of function deletion execution and the instruction of code conversion information are analyzed. In header record creation (step S102), header records of the natural language processing input file and the NL information file are created from the contents of the header record of the NL in-file. Reading data of the same data NO. 21 (step S
103) Processing takes all valid data of records having the same data NO as a processing unit. Therefore, valid data is extracted from data records having the same data NO in the NL in-file data records. In data processing (step S104), functions of data extracted from the NL in-file are deleted and code converted. Information on the deleted function and code conversion information are output to the NL information file, and processed data is output to the natural language processing input file. Further, in creating a data record (step S105), the data after processing (deleting functions, code conversion) of the same data number is output to the natural language processing input file, and the processing information is output to the NL information file.

【００３０】一方、図１３の出力ルーチンＳ２００は図
１６に示すように、自然言語処理の後処理として自然言
語処理出力ファイルとＮＬ情報ファイルを、パラメータ
と共に読込んでＮＬアウトファイルを作成するものであ
り、その詳細は図１７のようになっている。すなわち、
自然言語処理出力ファイルとＮＬ情報ファイルを読込ん
で、パラメータの指定によるファンクションの復帰及び
コード変換（　システム固有コード→外部）　を行ない
、ＮＬアウトファイルを作成する。処理終了後にジョブ
名等をリスト出力する。パラメータチェック（ステップＳ２０１）　では、ファ
ンクション復帰実行の有無及びコード変換情報の指示の
解析を行なう。ヘダーレコードの作成（ステップＳ２０
３）　では、ＮＬ情報ファイル及び自然言語処理出力フ
ァイルのヘダーレコードの内容よりＮＬアウトファイル
のヘダーレコードを作成する。同データＮＯのデータの
読込み（ステップＳ２０４）　は同データＮＯを持つレ
コードの全有効データを処理単位とする。自然言語処理
出力ファイルデータレコード中には、入力原文データと
処理結果データが存在するが、処理結果データのみを有
効データとする。従って、自然言語処理出力ファイルレ
コード中の同データＮＯを持つデータレコードから処理
結果データを抽出する。また、データの加工（ステップ
Ｓ２０５）　では、自然言語処理出力ファイルから抽出
したデータにファンクションの復帰及びコード変換を行
なう。加工したデータはＮＬアウトファイルに出力する
。On the other hand, as shown in FIG. 16, the output routine S200 in FIG. 13 reads the natural language processing output file and the NL information file together with parameters to create an NL out file as post-processing of the natural language processing. , the details are as shown in FIG. That is,
Reads the natural language processing output file and NL information file, returns the function by specifying parameters, performs code conversion (system-specific code → external), and creates an NL out file. After processing is completed, job names, etc. are output as a list. In the parameter check (step S201), the presence or absence of function return execution and the instruction of code conversion information are analyzed. Creating a header record (step S20
3) Next, a header record of the NL out file is created from the contents of the header record of the NL information file and the natural language processing output file. Reading of data with the same data number (step S204) uses all valid data of records having the same data number as a processing unit. Although input original text data and processing result data exist in the natural language processing output file data record, only the processing result data is considered valid data. Therefore, the processing result data is extracted from the data record having the same data number in the natural language processing output file record. Furthermore, in data processing (step S205), function restoration and code conversion are performed on the data extracted from the natural language processing output file. The processed data is output to the NL out file.

【００３１】この発明はＣＤ−ＲＯＭ等のデータベース
の構築支援として利用でき、検索用キーワードの抽出，
抽出したキーワードへの読みの付加を行ない得る。また
、印刷業務での利用が可能で、カナ振り機能を利用した
総ルビの印刷物作成や名簿の住所，氏名などの項目の自
動カナ振り，索引作成の支援システムとして利用できる
。[0031] This invention can be used to support the construction of databases such as CD-ROMs, and can be used to extract search keywords,
It is possible to add pronunciations to the extracted keywords. It can also be used in printing operations, and can be used as a support system for creating full-ruby printed materials using the kana translation function, automatic kana translation for items such as addresses and names in lists, and index creation.

【００３２】[0032]

【発明の効果】以上のようにこの発明の自然言語処理シ
ステムにより索引語抽出方法によれば、本文データから
自動的に索引語を抽出することができる。As described above, according to the index word extraction method using the natural language processing system of the present invention, index words can be automatically extracted from text data.

[Brief explanation of the drawing]

【図１】この発明の動作例を示すフローチャートである
。FIG. 1 is a flowchart showing an example of the operation of the present invention.

【図２】前処理の動作例を示すフローチャートである。FIG. 2 is a flowchart showing an example of preprocessing operation.

【図３】自然言語処理する原文の例を示す図である。FIG. 3 is a diagram showing an example of an original text subjected to natural language processing.

【図４】分かちカナの例を示す図である。FIG. 4 is a diagram showing an example of dividing kana.

【図５】後処理の動作例を示すフローチャートである。FIG. 5 is a flowchart showing an example of post-processing operation.

【図６】自然言語処理システムのハードウエア構成例を
示すブロック図である。FIG. 6 is a block diagram showing an example of a hardware configuration of a natural language processing system.

【図７】そのソフトウエア構成例を示す図である。FIG. 7 is a diagram showing an example of the software configuration.

【図８】分かち書きの出力形式を示す図である。FIG. 8 is a diagram showing an output format of parting notes.

【図９】分かち書きの出力形式を示す図である。FIG. 9 is a diagram showing an output format of parting notes.

【図１０】分野別辞書を使用したカナ振りの出力形式を
示す図力形式を示す図である。FIG. 10 is a diagram showing a diagrammatic format showing an output format for kana-furi using a field-specific dictionary.

【図１１】総ルビ振りの出力形式を示す図である。FIG. 11 is a diagram illustrating an output format of a total ruby pattern.

【図１２】キーワード抽出及びキーワードへのカナ振り
の出力形式を示す図である。FIG. 12 is a diagram showing the output format of keyword extraction and kana translation for the keyword.

【図１３】この発明に用いる汎用ファイルの構成例を示
すフローチャートである。FIG. 13 is a flowchart showing an example of the configuration of a general-purpose file used in the present invention.

【図１４】入力ルーチンの入出力を示す図である。FIG. 14 is a diagram showing input and output of an input routine.

【図１５】入力ルーチンの詳細を示すフローチャートで
ある。FIG. 15 is a flowchart showing details of an input routine.

【図１６】出力ルーチンの入出力を示す図である。FIG. 16 is a diagram showing input and output of an output routine.

【図１７】出力ルーチンの詳細を示すフローチャートで
ある。FIG. 17 is a flowchart showing details of an output routine.

【図１８】基本辞書とユーザ辞書，システム辞書の関係
を示す図である。FIG. 18 is a diagram showing the relationship between a basic dictionary, a user dictionary, and a system dictionary.

【図１９】索引語を抽出する自然言語処理の例を示す図
である。FIG. 19 is a diagram illustrating an example of natural language processing for extracting index words.

[Explanation of symbols]

１０　　ホストマシン１１　　ＣＰＵ１２　　メモリ１４　　磁気デイスク装置１５　　カセット磁気テープ装置２０　　磁気テープ装置２１　　レーザープリンタ２２　　確認／修正用端末２３　　コンソール端末 10 Host machine 11 CPU 12 Memory 14 Magnetic disk device 15 Cassette magnetic tape device 20 Magnetic tape device 21 Laser printer 22 Confirmation/correction terminal 23 Console terminal

Claims

[Claims]

[Claim 1] After registering predetermined index words in the user dictionary, natural language processing is used to perform separation/kana writing on the main text data from which index words are to be extracted, and refer to part-of-speech information for each word. 1. A method for extracting an index word using a natural language processing system, characterized in that the index word is extracted by using a natural language processing system.