JPH05128159A

JPH05128159A - Key word extraction and its device

Info

Publication number: JPH05128159A
Application number: JP3291223A
Authority: JP
Inventors: Shiyou Imasato; 詔今郷
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-11-07
Filing date: 1991-11-07
Publication date: 1993-05-25

Abstract

PURPOSE:To improve accuracy of key word extraction by identifying the kind of document information whose content character string is divided into plural logical elements in advance, detecting the logical element conforming to prestored logical element for each kind and extracting a key word from among content character string in the detected logical element. CONSTITUTION:A key word extracting means 3 is a part of a document retrieving unit 2. The document retrieving unit 2 is connected to a data base, etc., which are not shown in the Figure. Document information has a structure in which a content character string is divided into plural logical elements in advance. A document content identifying means 5 divides the document information into logical element and content character string by analyzing with a context free grammar, etc., for detecting correspondence relation. The key word detecting means 3 sequentially reads the logical element stored in a specific logical element name table 4 and then detects coincidence with the logical content separated by the means 5 for extracting a key word out of the content character string in the detected logical element.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書ファイリングシス
テム等に利用されるキーワード抽出方法及び装置に関す
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extracting method and apparatus used in a document filing system or the like.

【０００２】[0002]

【従来の技術】従来の文書検索装置は、文書情報の登録
時にオペレータが適切と思われるキーワードを選択して
シソーラスにより分類しているが、このようなキーワー
ドの追加や削除に伴うシソーラスを更新する作業が煩雑
であるためにオペレータの負担が増大している。2. Description of the Related Art In a conventional document search apparatus, an operator selects a keyword that is considered appropriate at the time of registration of document information and classifies the keyword according to a thesaurus. However, the thesaurus is updated when such a keyword is added or deleted. The burden on the operator is increasing because the work is complicated.

【０００３】そこで、このような問題を解決するため、
形態素解析等の機械的な情報処理で文書情報の内容文字
列からキーワードを抽出するキーワード抽出装置が開発
されている。そして、このようなキーワード抽出装置
は、例えば、入力された文書情報の内容文字列からキー
ワードを抽出するキーワード抽出手段を設けた構造など
となっており、テキストファイルとして形成された文書
情報からキーワードを抽出するようになっている。この
ようにすることで、例えば、抽出されたキーワードと文
書情報との対応関係を文書検索装置の転置ファイルに登
録しておくことで、この転置ファイル内のキーワードか
ら所定の文書情報を後に検索することができる。Therefore, in order to solve such a problem,
A keyword extraction device has been developed which extracts a keyword from a content character string of document information by mechanical information processing such as morphological analysis. Then, such a keyword extracting device has, for example, a structure provided with keyword extracting means for extracting a keyword from the content character string of the input document information, and the keyword is extracted from the document information formed as a text file. It is designed to be extracted. By doing this, for example, by registering the correspondence between the extracted keyword and the document information in the transposed file of the document retrieval device, the predetermined document information is later retrieved from the keyword in the transposed file. be able to.

【０００４】ここで、このようなキーワード抽出方法と
しては、キーワードの抽出対象である文書情報として予
め種別を想定しないものとするものとが提案されてお
り、例えば、予め文書情報の種別を想定しないキーワー
ド抽出方法では、文書情報がテキストファイル形式で形
成されていればキーワードを抽出することができる。一
方、予め文書情報の種別を想定するものとして、例え
ば、木本晴夫「言語処理を用いたキーワード自動抽出」
第一回人工知能学会全国大会(1987年)に提案されたキー
ワード抽出方法では、文書情報を新聞記事などと仮定し
て文書構造の特性に依存してキーワードを抽出するよう
になっている。Here, as such a keyword extraction method, it has been proposed that the type of document information from which a keyword is extracted is not assumed in advance. For example, the type of document information is not assumed in advance. In the keyword extracting method, the keyword can be extracted if the document information is formed in the text file format. On the other hand, assuming the type of document information in advance, for example, Haruo Kimoto “Automatic keyword extraction using language processing”
The keyword extraction method proposed at the 1st National Conference of the Japanese Society for Artificial Intelligence (1987) assumes that document information is a newspaper article and extracts keywords depending on the characteristics of the document structure.

【０００５】[0005]

【発明が解決しようとする課題】上述のように、予め文
書情報の種別を想定しないキーワード抽出方法では、テ
キストファイル形式の文書情報からキーワードを抽出す
ることができるが、これではキーワードの抽出精度を向
上させることが困難である。また、予め文書情報の種別
を想定するキーワード抽出方法では、キーワードの抽出
精度は良好となるが、これはキーワードを抽出する文書
情報が特定の種別に限定されるために汎用性が低下して
いる。As described above, according to the keyword extraction method which does not assume the type of document information in advance, the keyword can be extracted from the document information in the text file format. It is difficult to improve. Further, in the keyword extraction method that assumes the type of document information in advance, the keyword extraction accuracy is good, but this is less versatile because the document information from which the keyword is extracted is limited to a specific type. ..

【０００６】[0006]

【課題を解決するための手段】請求項１記載の発明は、
入力された文書情報からキーワード抽出手段がキーワー
ドを抽出するようにしたキーワード抽出方法において、
予め内容文字列が複数の論理要素として区分された文書
情報の種別を種別識別手段が識別し、この識別された文
書情報の種別毎に予め要素記憶手段に記憶された論理要
素と一致する論理要素を要素検出手段が文書情報から検
出し、この検出された論理要素内の内容文字列から前記
キーワード抽出手段がキーワードを抽出するようにし
た。The invention according to claim 1 is
In the keyword extracting method in which the keyword extracting means extracts the keyword from the input document information,
The type identifying means identifies the type of document information in which the content character string is divided into a plurality of logical elements in advance, and a logical element that matches the logical element stored in advance in the element storage means for each identified type of document information. Is detected from the document information by the element detecting means, and the keyword extracting means extracts the keyword from the content character string in the detected logical element.

【０００７】請求項２記載の発明は、入力された文書情
報からキーワード抽出手段がキーワードを抽出するよう
にしたキーワード抽出方法において、予め内容文字列が
複数の論理要素として区分された文書情報の種別を種別
識別手段が識別し、この識別された文書情報の種別毎に
予め要素記憶手段に記憶された論理要素と一致しない論
理要素を要素検出手段が文書情報から検出し、この検出
された論理要素内の内容文字列から前記キーワード抽出
手段がキーワードを抽出するようにした。According to a second aspect of the invention, in the keyword extracting method in which the keyword extracting means extracts the keyword from the input document information, the type of the document information in which the content character string is divided into a plurality of logical elements in advance. Is identified by the type identifying means, and the element detecting means detects, from the document information, a logical element that does not match the logical element stored in advance in the element storing means for each type of the identified document information, and the detected logical element The keyword extracting means extracts the keyword from the content character string in the.

【０００８】請求項３記載の発明は、キーワード抽出手
段が文書情報の内容文字列から抽出したキーワードの重
要度を重要度算定手段が算定し、要素検出手段が検出し
た論理要素内の文書情報から抽出されたキーワードの重
要度を重要度更新手段が更新するようにした。According to the third aspect of the invention, the importance degree calculating means calculates the importance degree of the keyword extracted from the content character string of the document information by the keyword extracting means, and the document information in the logical element detected by the element detecting means is used. The importance updating means updates the importance of the extracted keywords.

【０００９】請求項４記載の発明は、入力された文書情
報の内容文字列からキーワードを抽出するキーワード抽
出手段を設けたキーワード抽出装置において、予め内容
文字列が複数の論理要素として区分された文書情報の種
別を識別する種別識別手段を設け、この種別識別手段が
識別する文書情報の種別毎に予め特定の論理要素を記憶
した要素記憶手段を設け、この要素記憶手段に記憶され
た論理要素と一致する論理要素を前記種別識別手段が識
別した文書情報から検出する要素検出手段を設け、この
要素検出手段が検出した論理要素内の内容文字列からキ
ーワードを抽出するキーワード抽出手段を設けた。According to a fourth aspect of the present invention, in a keyword extracting device provided with keyword extracting means for extracting a keyword from a content character string of input document information, a document in which the content character string is divided into a plurality of logical elements in advance. A type identification means for identifying the type of information is provided, and element storage means for storing a specific logical element in advance for each type of document information identified by the type identification means is provided, and a logical element stored in the element storage means is provided. Element detecting means for detecting the matching logical element from the document information identified by the type identifying means, and keyword extracting means for extracting a keyword from the content character string in the logical element detected by the element detecting means are provided.

【００１０】請求項５記載の発明は、入力された文書情
報の内容文字列からキーワードを抽出するキーワード抽
出手段を設けたキーワード抽出装置において、予め内容
文字列が複数の論理要素として区分された文書情報の種
別を識別する種別識別手段を設け、この種別識別手段が
識別する文書情報の種別毎に予め特定の論理要素を記憶
した要素記憶手段を設け、この要素記憶手段に記憶され
た論理要素と一致しない論理要素を前記種別識別手段が
識別した文書情報から検出する要素検出手段を設け、こ
の要素検出手段が検出した論理要素内の内容文字列から
キーワードを抽出するキーワード抽出手段を設けた。According to a fifth aspect of the present invention, in a keyword extracting device provided with keyword extracting means for extracting a keyword from a content character string of input document information, a document in which the content character string is divided into a plurality of logical elements in advance. A type identification means for identifying the type of information is provided, and element storage means for storing a specific logical element in advance for each type of document information identified by the type identification means is provided, and a logical element stored in the element storage means is provided. Element detecting means is provided for detecting a non-matching logical element from the document information identified by the type identifying means, and keyword extracting means is provided for extracting a keyword from the content character string in the logical element detected by the element detecting means.

【００１１】請求項６記載の発明は、キーワード抽出手
段が文書情報の内容文字列から抽出したキーワードの重
要度を算定する重要度算定手段を設け、要素検出手段が
検出した論理要素内の文書情報から抽出されたキーワー
ドの重要度を更新する重要度更新手段を設けた。According to a sixth aspect of the present invention, there is provided importance calculating means for calculating the importance of the keyword extracted from the content character string of the document information by the keyword extracting means, and the document information in the logical element detected by the element detecting means. An importance updating means for updating the importance of the keywords extracted from is provided.

【００１２】[0012]

【作用】請求項１及び４記載の発明は、文書情報の種別
に対応して論理構造的に重要な内容文字列からキーワー
ドを抽出するようなことができるので、キーワードの抽
出精度を向上させることができ、このようなキーワード
を抽出する文書情報として複数の種別を設定しておくこ
とができるので、キーワードの抽出対象となる文書情報
の汎用性を向上させることができる。According to the inventions described in claims 1 and 4, it is possible to extract a keyword from a content character string which is logically structurally important in correspondence with the type of document information, so that the keyword extraction accuracy is improved. Since a plurality of types can be set as the document information for extracting such a keyword, the versatility of the document information for which the keyword is extracted can be improved.

【００１３】請求項２及び５記載の発明は、文書情報の
種別に対応して論理構造的に重要でない部分以外の内容
文字列からキーワードを抽出するようなことができるの
で、キーワードの抽出精度を向上させることができ、こ
のようなキーワードを抽出する文書情報として複数の種
別を設定しておくことができるので、キーワードの抽出
対象となる文書情報の汎用性を向上させることができ
る。According to the second and fifth aspects of the present invention, the keyword can be extracted from the content character string other than the logically unimportant portion corresponding to the type of document information. Since a plurality of types can be set as the document information for extracting such a keyword, the versatility of the document information from which the keyword is extracted can be improved.

【００１４】請求項３及び６記載の発明は、文書情報の
論理構造的な重要度に対応してキーワードの重要度を修
正するようなことができるので、キーワードの重要度に
基づいて文書情報を検索する際の検索精度を向上させる
ことができる。According to the third and sixth aspects of the present invention, the importance of the keyword can be corrected in correspondence with the importance of the logical structure of the document information. Therefore, the document information can be stored based on the importance of the keyword. It is possible to improve the search accuracy when performing a search.

【００１５】[0015]

【実施例】請求項１及び４記載の発明の実施例を図面に
基づいて説明する。まず、図２のブロック図に例示する
ように、ここで例示するキーワード抽出装置１は文書検
索装置２の一部として形成されており、この文書検索装
置２は多数の文書情報を格納したデータベース(図示せ
ず)などに接続されている。そして、このキーワード抽
出装置１がキーワードを抽出する文書情報は、図３に例
示するように、予め内容文字列が複数の論理要素として
区分された構造となっており、ここでは文書情報の区分
の開始と終了とを示す論理要素が〈title〉や〈／titl
e〉等で表現されて内容文字列を挾んでいる。さらに、
このように形成された文書情報は論理要素が階層的に設
定されており、ここでは最上位に位置する論理要素であ
る〈report〉内の名称で文書情報の種別である文書タイ
プが表現されている。なお、上述のように論理要素を文
書情報の内容文字列に付与する具体的手段は、例えば、ＩＳＯ 8879 Ｉnformation processing−Ｔext and office syst
ems−Ｓtandard Ｇeneralized Ｍarkup Ｌanguage
（ＳＧＭＬ) などに開示されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the invention described in claims 1 and 4 will be described with reference to the drawings. First, as illustrated in the block diagram of FIG. 2, the keyword extracting device 1 illustrated here is formed as a part of the document searching device 2, and the document searching device 2 stores a database ( (Not shown) or the like. Then, the document information from which the keyword extracting device 1 extracts the keyword has a structure in which the content character string is divided into a plurality of logical elements in advance as illustrated in FIG. Logical elements indicating start and end are <title> and </ titl>
It is represented by e> etc., and sandwiches the content character string. further,
In the document information formed in this way, logical elements are hierarchically set, and here the document type, which is the type of document information, is expressed by the name in <report>, which is the highest logical element. There is. Note that, as described above, a concrete means for giving a logical element to the content character string of the document information is, for example, ISO 8879 Information processing-Text and office syst.
ems-Standard Generalized Markup Language
(SGML) and the like.

【００１６】そこで、本実施例で例示する文書検索装置
２は、キーワード抽出手段３に特定論理要素名テーブル
４や文書内容識別手段５及び種別識別手段(図示せず)等
を接続した構造のキーワード抽出装置１に、転置ファイ
ル更新手段６等を接続した構造などとなっている。Therefore, in the document retrieval apparatus 2 exemplified in this embodiment, the keyword extraction means 3 is connected to the specific logical element name table 4, the document content identification means 5, the type identification means (not shown), etc. The extraction device 1 has a structure in which the transposed file updating means 6 and the like are connected.

【００１７】より詳細には、前記文書内容識別手段５
は、図４に例示する論理要素の形態を利用した文脈自由
文法による解析などで文書情報を論理要素と内容文字列
とに分離して対応関係を検出するようになっており、前
記種別識別手段は、最上位の論理要素の名称などから文
書情報の識別である文書タイプを識別するようになって
いる。また、要素記憶手段である特定論理要素名テーブ
ル４は、図５に例示するように、文書情報の種別である
文書タイプ毎に予め特定の論理要素を名称で記憶してい
る。なお、このような特定論理要素名テーブル４は、予
め作成しておけば良く、文書情報の登録時毎に作成する
ようなことは要しない。More specifically, the document content identification means 5
Is configured to separate the document information into logical elements and content character strings by a context-free grammar analysis utilizing the form of the logical elements illustrated in FIG. Identifies the document type, which is the identification of document information, from the name of the highest logical element. Further, as illustrated in FIG. 5, the specific logical element name table 4, which is an element storage unit, stores in advance a specific logical element with a name for each document type which is a type of document information. Note that such a specific logical element name table 4 may be created in advance, and it is not necessary to create it every time the document information is registered.

【００１８】そして、要素検出手段でもある前記キーワ
ード抽出手段３は、前記特定論理要素名テーブル４に記
憶された特定の論理要素を順次読出して前記文書内容識
別手段５が分離した文書情報の論理内容との一致を検出
し、このようにして検出された論理要素内の文書情報の
内容文字列からキーワードを抽出するようになってい
る。なお、このようなキーワード抽出手段３によるキー
ワードの抽出作業は、例えば、形態素解析による名詞の
検出などのような公知技術で簡易に実行される。The keyword extracting means 3, which is also an element detecting means, sequentially reads out the specific logical elements stored in the specific logical element name table 4, and the document content identifying means 5 separates the logical content of the document information. Is detected, and the keyword is extracted from the content character string of the document information in the logical element thus detected. Note that the keyword extracting operation by the keyword extracting unit 3 is easily executed by a known technique such as detecting a noun by morphological analysis.

【００１９】さらに、上述のような構造のキーワード抽
出装置１に接続された前記転置ファイル更新手段６は、
これに接続された転置ファイル(図示せず)の記憶情報を
抽出されたキーワードで更新するようになっており、こ
の転置ファイルの記憶情報とはキーワードに対応して文
書情報を一意に検出できるインデクス等で形成されてい
る。Further, the transposed file updating means 6 connected to the keyword extracting device 1 having the above-mentioned structure,
The storage information of the transposed file (not shown) connected to this is updated with the extracted keyword.The storage information of this transposed file is an index that can uniquely detect the document information corresponding to the keyword. Etc.

【００２０】このような構成において、このキーワード
抽出装置１の処理作業を文書検索装置２の処理作業と共
に、図１に例示するフローチャートに基づいて以下に詳
述する。まず、この文書検索装置２のキーワード抽出装
置１に前述のような論理構造の文書情報が入力される
と、その最上位の論理要素を種別識別手段が検出して名
称を読取ることで文書タイプが識別されるので、この識
別された文書タイプに従ってキーワード抽出手段３が特
定論理要素名テーブル４から対応する論理要素を名称で
順次読出すことになる。一方、文書内容識別手段５が文
書情報から論理要素を抽出するので、キーワード抽出手
段３は、特定論理要素名テーブル４から読出した論理要
素と文書内容識別手段５が抽出した論理要素との一致を
検出し、これが検出された論理要素内の内容文字列から
キーワードを抽出することになる。With such a configuration, the processing operation of the keyword extracting device 1 will be described in detail below together with the processing operation of the document searching device 2 based on the flowchart illustrated in FIG. First, when the document information having the above-described logical structure is input to the keyword extracting device 1 of the document searching device 2, the type identifying unit detects the highest logical element and reads the name to determine the document type. Since they are identified, the keyword extracting means 3 sequentially reads the corresponding logical elements by name from the specific logical element name table 4 according to the identified document type. On the other hand, since the document content identification means 5 extracts a logical element from the document information, the keyword extraction means 3 matches the logical element read from the specific logical element name table 4 with the logical element extracted by the document content identification means 5. It will be detected, and this will extract a keyword from the content character string in the detected logic element.

【００２１】この時、実際には文書内容識別手段５が文
書情報から分離した全ての内容文字列から予めキーワー
ド抽出手段３が形態素解析による名詞の抽出でキーワー
ド候補を選出しておき、特定論理要素名テーブル４から
読出した論理要素の名称と名称が一致する論理要素内の
キーワード候補のみをキーワードとして選定するように
なっている。At this time, actually, the keyword extracting unit 3 selects a keyword candidate in advance by extracting a noun by morphological analysis from all the content character strings separated from the document information by the document content identifying unit 5, and the specific logical element Only the keyword candidates in the logical element having the same name as the logical element name read from the name table 4 are selected as keywords.

【００２２】なお、この文書検索装置２では、上述のよ
うにしてキーワード抽出装置１が抽出したキーワードで
転置ファイル更新手段６が転置ファイルの記憶情報を更
新するようになっている。In the document searching device 2, the transposed file updating means 6 updates the storage information of the transposed file with the keyword extracted by the keyword extracting device 1 as described above.

【００２３】このようにすることで、この文書検索装置
２のキーワード抽出装置１では、文書情報の種別に対応
してキーワードを抽出する部分を論理構造的に規定する
ことができるのでキーワードの抽出精度が極めて良好で
あり、このようなキーワードを抽出する文書情報として
複数の種別を設定しておくことができるので汎用性を向
上させることができる。そして、このようにして抽出し
たキーワードで転置ファイルが更新されるので、この文
書検索装置２は文書情報の検索精度が極めて良好であ
る。By doing so, the keyword extracting device 1 of the document searching device 2 can logically define the portion for extracting the keyword corresponding to the type of the document information, so that the keyword extraction accuracy can be improved. Is very good, and a plurality of types can be set as the document information for extracting such a keyword, so that versatility can be improved. Since the transposed file is updated with the keyword extracted in this way, the document retrieval apparatus 2 has extremely good retrieval accuracy of document information.

【００２４】なお、本実施例では請求項１及び４記載の
発明の実施例として特定論理要素名テーブル４に記憶さ
れた論理要素からキーワードを抽出することで、キーワ
ードの抽出に最適な部分を規定して良好なキーワードを
抽出することを例示したが、請求項２及び５記載の発明
のように、特定論理要素名テーブル４に記憶されていな
い論理要素からキーワードを抽出することで、キーワー
ドの抽出に不適な部分を排除して不用なキーワードの抽
出を防止することも実施可能である。In this embodiment, by extracting a keyword from the logical element stored in the specific logical element name table 4 as an embodiment of the invention described in claims 1 and 4, the optimum portion for extracting the keyword is defined. However, the keyword extraction is performed by extracting keywords from logical elements not stored in the specific logical element name table 4 as in the inventions according to claims 2 and 5. It is also possible to prevent the extraction of unnecessary keywords by excluding the portions that are not suitable for.

【００２５】さらに、請求項３及び６記載の発明のよう
に、キーワード抽出手段が文書情報の内容文字列から抽
出したキーワードの重要度を算定する重要度算定手段
(図示せず)を設け、要素検出手段が検出した論理要素内
の文書情報から抽出されたキーワードの重要度を更新す
る重要度更新手段(図示せず)を設けることも実施可能で
ある。この場合、重要度算定手段によるキーワードの重
要度の算定は、例えば、(そのキーワードの個数)／(抽
出された全てのキーワードの個数)などの数式で実行さ
れ、重要度更新手段による重要度の更新は予め設定され
た定数を乗算することなどで実現される。なお、このよ
うな重要度の更新は、例えば、特定論理要素名テーブル
４に記憶された論理要素から抽出されたキーワードの重
要度を増加させることで、文書情報の検索に適正なキー
ワードの重要度を向上させることや、特定論理要素名テ
ーブル４に記憶されていない論理要素から抽出されたキ
ーワードの重要度を低下させることで、文書情報の検索
に不適なキーワードの重要度を低減することが実施可能
であり、このようにすることでキーワードの重要度に基
づいて文書情報を検索する際の検索精度を向上させるこ
とができる。Further, as in the invention according to claims 3 and 6, the importance degree calculating means for calculating the importance degree of the keyword extracted from the content character string of the document information by the keyword extracting means.
It is also possible to provide (not shown) and provide importance updating means (not shown) for updating the importance of the keyword extracted from the document information in the logical element detected by the element detecting means. In this case, the calculation of the importance degree of the keyword by the importance degree calculation means is performed by a mathematical expression such as (the number of the keywords) / (the number of all the extracted keywords), and the importance degree calculation means calculates the importance degree of the keyword. The update is realized by, for example, multiplying a preset constant. Note that such updating of the importance level is performed by, for example, increasing the importance level of the keyword extracted from the logical element stored in the specific logical element name table 4 so that the importance level of the keyword appropriate for searching the document information is increased. By reducing the importance of keywords extracted from logical elements that are not stored in the specific logical element name table 4 to reduce the importance of keywords unsuitable for searching document information. This is possible, and by doing so, it is possible to improve the search accuracy when searching the document information based on the importance of the keyword.

【００２６】[0026]

【発明の効果】請求項１記載の発明は、入力された文書
情報からキーワード抽出手段がキーワードを抽出するよ
うにしたキーワード抽出方法において、予め内容文字列
が複数の論理要素として区分された文書情報の種別を種
別識別手段が識別し、この識別された文書情報の種別毎
に予め要素記憶手段に記憶された論理要素と一致する論
理要素を要素検出手段が文書情報から検出し、この検出
された論理要素内の内容文字列から前記キーワード抽出
手段がキーワードを抽出するようにしたことにより、文
書情報の種別に対応して論理構造的に重要な内容文字列
からキーワードを抽出するようなことができるので、キ
ーワードの抽出精度を向上させることができ、このよう
なキーワードを抽出する文書情報として複数の種別を設
定しておくことができるので、キーワードの抽出対象と
なる文書情報の汎用性を向上させることができる等の効
果を有するものである。According to the invention described in claim 1, in the keyword extracting method in which the keyword extracting means extracts the keyword from the input document information, the document information in which the content character string is divided into a plurality of logical elements in advance. The type identification unit identifies the type of the document information, and the element detection unit detects, from the document information, a logical element that matches the logical element stored in advance in the element storage unit for each type of the identified document information. Since the keyword extracting means extracts the keyword from the content character string in the logical element, it is possible to extract the keyword from the content character string which is logically important in correspondence with the type of document information. Therefore, it is possible to improve the extraction accuracy of keywords, and it is possible to set multiple types of document information for extracting such keywords. Since kill those having effects such can improve the versatility of the document information as a keyword to be extracted.

【００２７】請求項２記載の発明は、入力された文書情
報からキーワード抽出手段がキーワードを抽出するよう
にしたキーワード抽出方法において、予め内容文字列が
複数の論理要素として区分された文書情報の種別を種別
識別手段が識別し、この識別された文書情報の種別毎に
予め要素記憶手段に記憶された論理要素と一致しない論
理要素を要素検出手段が文書情報から検出し、この検出
された論理要素内の内容文字列から前記キーワード抽出
手段がキーワードを抽出するようにしたことにより、文
書情報の種別に対応して論理構造的に重要でない部分以
外の内容文字列からキーワードを抽出するようなことが
できるので、キーワードの抽出精度を向上させることが
でき、このようなキーワードを抽出する文書情報として
複数の種別を設定しておくことができるので、キーワー
ドの抽出対象となる文書情報の汎用性を向上させること
ができる等の効果を有するものである。According to a second aspect of the invention, in the keyword extracting method in which the keyword extracting means extracts the keyword from the input document information, the type of the document information in which the content character string is divided into a plurality of logical elements in advance. Is identified by the type identifying means, and the element detecting means detects, from the document information, a logical element that does not match the logical element stored in advance in the element storing means for each type of the identified document information, and the detected logical element Since the keyword extracting means extracts the keyword from the content character string in the inside, it is possible to extract the keyword from the content character string other than the portion that is not logically important in correspondence with the type of the document information. Since it is possible to improve the keyword extraction accuracy, multiple types can be set as document information for extracting such keywords. It is possible keep, those having the effect of such can improve the versatility of the document information as a keyword to be extracted.

【００２８】請求項３記載の発明は、キーワード抽出手
段が文書情報の内容文字列から抽出したキーワードの重
要度を重要度算定手段が算定し、要素検出手段が検出し
た論理要素内の文書情報から抽出されたキーワードの重
要度を重要度更新手段が更新するようにしたことによ
り、文書情報の論理構造的な重要度に対応してキーワー
ドの重要度を修正するようなことができるので、キーワ
ードの重要度に基づいて文書情報を検索する際の検索精
度を向上させることができる等の効果を有するものであ
る。According to a third aspect of the invention, the importance degree calculating means calculates the importance degree of the keyword extracted from the content character string of the document information by the keyword extracting means, and the document information in the logical element detected by the element detecting means is used. Since the importance updating means updates the importance of the extracted keyword, the importance of the keyword can be corrected according to the logical structural importance of the document information. This has the effect of improving the search accuracy when searching for document information based on importance.

【００２９】請求項４記載の発明は、入力された文書情
報の内容文字列からキーワードを抽出するキーワード抽
出手段を設けたキーワード抽出装置において、予め内容
文字列が複数の論理要素として区分された文書情報の種
別を識別する種別識別手段を設け、この種別識別手段が
識別する文書情報の種別毎に予め特定の論理要素を記憶
した要素記憶手段を設け、この要素記憶手段に記憶され
た論理要素と一致する論理要素を前記種別識別手段が識
別した文書情報から検出する要素検出手段を設け、この
要素検出手段が検出した論理要素内の内容文字列からキ
ーワードを抽出するキーワード抽出手段を設けたことに
より、文書情報の種別に対応して論理構造的に重要な内
容文字列からキーワードを抽出するようなことができる
ので、キーワードの抽出精度を向上させることができ、
このようなキーワードを抽出する文書情報として複数の
種別を設定しておくことができるので、キーワードの抽
出対象となる文書情報の汎用性を向上させることができ
る等の効果を有するものである。According to a fourth aspect of the present invention, in a keyword extracting device provided with keyword extracting means for extracting a keyword from a content character string of input document information, a document in which the content character string is divided into a plurality of logical elements in advance. A type identification means for identifying the type of information is provided, and element storage means for storing a specific logical element in advance for each type of document information identified by the type identification means is provided, and a logical element stored in the element storage means is provided. By providing the element detecting means for detecting the matching logical element from the document information identified by the type identifying means, and by providing the keyword extracting means for extracting the keyword from the content character string in the logical element detected by the element detecting means. , It is possible to extract the keyword from the content string that is logically important in correspondence with the type of document information. It is possible to improve the extraction accuracy,
Since a plurality of types can be set as the document information for extracting such a keyword, the versatility of the document information from which the keyword is extracted can be improved.

【００３０】請求項５記載の発明は、入力された文書情
報の内容文字列からキーワードを抽出するキーワード抽
出手段を設けたキーワード抽出装置において、予め内容
文字列が複数の論理要素として区分された文書情報の種
別を識別する種別識別手段を設け、この種別識別手段が
識別する文書情報の種別毎に予め特定の論理要素を記憶
した要素記憶手段を設け、この要素記憶手段に記憶され
た論理要素と一致しない論理要素を前記種別識別手段が
識別した文書情報から検出する要素検出手段を設け、こ
の要素検出手段が検出した論理要素内の内容文字列から
キーワードを抽出するキーワード抽出手段を設けたこと
により、文書情報の種別に対応して論理構造的に重要で
ない部分以外の内容文字列からキーワードを抽出するよ
うなことができるので、キーワードの抽出精度を向上さ
せることができ、このようなキーワードを抽出する文書
情報として複数の種別を設定しておくことができるの
で、キーワードの抽出対象となる文書情報の汎用性を向
上させることができる等の効果を有するものである。According to a fifth aspect of the present invention, in a keyword extracting device provided with keyword extracting means for extracting a keyword from a content character string of input document information, a document in which the content character string is divided into a plurality of logical elements in advance. A type identification means for identifying the type of information is provided, and element storage means for storing a specific logical element in advance for each type of document information identified by the type identification means is provided, and a logical element stored in the element storage means is provided. By providing the element detecting means for detecting the unmatched logical element from the document information identified by the type identifying means, and by providing the keyword extracting means for extracting the keyword from the content character string in the logical element detected by the element detecting means. It is possible to extract a keyword from a content character string other than a part that is not logically important in correspondence with the type of document information. With this, it is possible to improve the keyword extraction accuracy, and it is possible to set a plurality of types as the document information for extracting such a keyword, so that the versatility of the document information from which the keyword is extracted is improved. It is possible to obtain such effects.

【００３１】請求項６記載の発明は、キーワード抽出手
段が文書情報の内容文字列から抽出したキーワードの重
要度を算定する重要度算定手段を設け、要素検出手段が
検出した論理要素内の文書情報から抽出されたキーワー
ドの重要度を更新する重要度更新手段を設けたことによ
り、文書情報の論理構造的な重要度に対応してキーワー
ドの重要度を修正するようなことができるので、キーワ
ードの重要度に基づいて文書情報を検索する際の検索精
度を向上させることができる等の効果を有するものであ
る。According to a sixth aspect of the present invention, the keyword extracting means is provided with importance calculating means for calculating the importance of the keyword extracted from the content character string of the document information, and the document information in the logical element detected by the element detecting means. Since the importance updating means for updating the importance of the keyword extracted from is provided, it is possible to correct the importance of the keyword in correspondence with the logical structural importance of the document information. This has the effect of improving the search accuracy when searching for document information based on importance.

[Brief description of drawings]

【図１】本発明の実施例を示すフローチャートである。FIG. 1 is a flow chart showing an embodiment of the present invention.

【図２】キーワード抽出装置を一部とする文書検索装置
を示すブロック図である。FIG. 2 is a block diagram showing a document search device including a keyword extraction device as a part.

【図３】文書情報の論理構造を示す概念説明図である。FIG. 3 is a conceptual explanatory diagram showing a logical structure of document information.

【図４】文脈自由文法の論理構造を示す概念説明図であ
る。FIG. 4 is a conceptual explanatory diagram showing a logical structure of context-free grammar.

【図５】特定論理要素名テーブルの構造を示す概念説明
図である。FIG. 5 is a conceptual explanatory diagram showing the structure of a specific logical element name table.

[Explanation of symbols]

１キーワード抽出装置３キーワード抽出手段かつ要素検出手段４要素記憶手段 1 keyword extracting device 3 keyword extracting means and element detecting means 4 element storing means

Claims

[Claims]

1. In a keyword extracting method in which a keyword extracting unit extracts a keyword from input document information, a type identifying unit identifies a type of document information in which a content character string is divided into a plurality of logical elements in advance. Then, the element detecting means detects, from the document information, a logical element that matches the logical element stored in advance in the element storing means for each type of the identified document information, and from the content character string in the detected logical element. A keyword extracting method, wherein the keyword extracting means extracts a keyword.

2. In a keyword extracting method in which a keyword extracting unit extracts a keyword from input document information, a type identifying unit identifies the type of document information in which a content character string is divided into a plurality of logical elements in advance. Then, the element detecting unit detects, from the document information, a logical element that does not match the logical element stored in advance in the element storing unit for each type of the identified document information, and from the content character string in the detected logical element. A keyword extracting method, wherein the keyword extracting means extracts a keyword.

3. The importance of the keyword extracted from the document information in the logical element detected by the element detecting means by the importance calculating means calculating the importance of the keyword extracted from the content character string of the document information by the keyword extracting means. The degree updating means updates the degree by the importance degree updating means.
The keyword extraction method described.

4. A keyword extracting device provided with a keyword extracting means for extracting a keyword from a content character string of input document information, wherein the type of document information in which the content character string is divided into a plurality of logical elements in advance is identified. A type identification unit is provided, and an element storage unit that stores a specific logical element in advance for each type of document information identified by the type identification unit is provided, and a logical element that matches the logical element stored in the element storage unit is described above. A keyword extracting device provided with an element detecting means for detecting from the document information identified by the type identifying means, and a keyword extracting means for extracting a keyword from a content character string in a logical element detected by the element detecting means. ..

5. In a keyword extracting device provided with a keyword extracting means for extracting a keyword from a content character string of input document information, the type of document information in which the content character string is divided into a plurality of logical elements in advance is identified. A type identification unit is provided, and an element storage unit that stores a specific logical element in advance for each type of document information identified by the type identification unit is provided, and a logical element that does not match the logical element stored in the element storage unit is described above. A keyword extracting device provided with an element detecting means for detecting from the document information identified by the type identifying means, and a keyword extracting means for extracting a keyword from a content character string in a logical element detected by the element detecting means. ..

6. An importance degree calculating means for calculating the importance degree of the keyword extracted from the content character string of the document information by the keyword extracting means is provided, and the keyword extracting means extracts the keyword extracted from the document information in the logical element detected by the element detecting means. 6. The keyword extracting device according to claim 4, further comprising an importance updating means for updating the importance.