JPH03122768A

JPH03122768A - Indexing supporting system

Info

Publication number: JPH03122768A
Application number: JP1260694A
Authority: JP
Inventors: Tetsuya Morita; 哲也森田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-10-05
Filing date: 1989-10-05
Publication date: 1991-05-24

Abstract

PURPOSE:To improve the reproducing ratio of retrieval by providing the indexing supporting device with a thesaurus holding the relating information of plural index words and a relating index word retrieving part for retrieving a relating index word for an extracted index word in a reference. CONSTITUTION:The relating index word retrieved from the thesaurus 5 by the relating index word retrieving part 6 is also displayed and selected together with an extracted index word extracted from the reference to be registered at the time of selecting the index word by a result displaying/selecting part 3. Even in the case of a phrase other than the ones included in the reference, the relating phrase is properly applied to the reference as an index word to form a data base. Since a user can more properly expressed the concept of the reference, the selecting margin of index words can be expanded at the time of retrieving the reference and the reproducing ratio of retrieval can be improved.

Description

【発明の詳細な説明】産業上の利用分野本発明は、データベースに登録しようとする文献から抽
出された索引語の候補の中から、利用者が索引語として
適切と思われるものを選択し、その文献とともにデータ
ベースに登録を行うようにした索引付は支援装置に関す
る。[Detailed Description of the Invention] Industrial Application Field The present invention allows a user to select an index word that is considered appropriate from among index word candidates extracted from documents to be registered in a database. The indexing that registers the documents in the database is related to the support device.

従来の技術一般に、この種の自動索引付けの研究は、文献或いは文
献集合の内容をよく表現し、かつ、文献或いは文献集合
間の識別が十分に行える単語列を索引語として個々の文
献に付与することを目的としている。Conventional technology In general, this type of automatic indexing research is based on assigning a word string to each document as an index term that well expresses the content of a document or document set and that can sufficiently identify documents or document sets. It is intended to.

例えば、文献■「自動索引付は研究の動向」（情報処理
学会誌、Ｖｏｌ、２５．隘９，１９８４）や、文献■「
日本語文献における重要語の自動抽出」　（情報処理学
会誌、Ｖｏｌ、　　ｌ　７．　Ｎ１１Ｌ２．　　ｌ　９
７６）に示されるように、ＩＢＭ社のＳＴＡ　Ｉ　Ｒ８
，米国ＤＤＣの機械補助索引、ＪＩＣＳＴのＪＡＫＡＳ
、京都大学のＳＭＡＲＴシステム等は、文献中から単語
を切り出し、不要語除去や文法規則等を適用して幾つか
の索引語候補を利用者に提示するシステムである。For example, the literature ■ ``Automatic indexing is a research trend'' (Information Processing Society of Japan Journal, Vol. 25, 9, 1984) and the literature ■ ``
"Automatic extraction of important words in Japanese literature" (Information Processing Society of Japan Journal, Vol. l 7. N11L2. l 9
76), IBM's STA I R8
, U.S. DDC mechanical auxiliary index, JICST JAKAS
, Kyoto University's SMART system, etc. are systems that extract words from documents, apply unnecessary word removal, grammatical rules, etc., and present several index word candidates to the user.

第２図はこのような従来の自動索引抽出装置のシステム
構成を示すもので、文書ファイルｌを索引自動抽出部２
により解析して索引語候補（抽出索引語）を自動抽出し
、その結果を結果表示・選択部３において利用者に対し
て表示させ、登録作業者に表示されている候補中から適
切と思われる索引語を選択させることにより、各文書に
対する索引付けを行い、データベースとしてインデック
スファイル４を作成するものである。Figure 2 shows the system configuration of such a conventional automatic index extraction device.
automatically extracts index word candidates (extracted index words), displays the results to the user in the result display/selection section 3, and selects the candidate that seems appropriate from among the candidates displayed to the registration operator. By selecting index words, each document is indexed and an index file 4 is created as a database.

発明が解決しようとする課題ところが、これらの従来システムでは、登録しようとす
る文献に付与できる索引語なる語句は、その文献内に現
れた語句のみである。厳密には、表記のゆれや同義語処
理によって文献内の索引語候補と同一でない表記のもの
が付与される場合もあるが、これらは同一語として認識
されて登録されるため、本質的には文献内の語句のみが
索引として付与されることには変りない。Problems to be Solved by the Invention However, in these conventional systems, the only index terms that can be assigned to a document to be registered are words that appear in the document. Strictly speaking, spellings that are not the same as the index word candidates in the document may be assigned due to variations in spelling or synonym processing, but these are recognized and registered as the same word, so essentially There is no change in the fact that only the words and phrases in the documents are added as indexes.

このような限られた索引語の付与によると、検索時にも
これらの索引語を厳密に入力して検索しなければならず
、検索の再現率が低く、或いは検索洩れが多発しやすい
一因となる。Due to the assignment of such limited index terms, these index terms must be entered strictly during the search, which is one reason why the recall rate of the search is low or the search is likely to be overlooked frequently. Become.

課題を解決するための手段請求項１記載の発明では、索引語と各索引語間の関係情
報とを保持したシソーラスと、データベースに登録しよ
うとする文献から抽出された抽出索引語に関連する関連
索引語を前記シソーラスより検索する関連索引語検索部
と、結果表示・選択部とよりなり、抽出された抽出索引
語と検索された関連索引語とを結果表示・選択部に表示
させて、この結果表示・選択部により選択された索引語
を、登録しようとする前記文献とともに前記データベー
スに格納させるように構成した。Means for Solving the Problems The invention according to claim 1 provides a thesaurus that holds index terms and relationship information between each index term, and relationships related to extracted index terms extracted from documents to be registered in a database. It consists of a related index word search section that searches for index words from the thesaurus, and a result display/selection section. The index word selected by the result display/selection section is configured to be stored in the database together with the document to be registered.

請求項２記載の発明では、請求項１記載の発明の関連索
引語検索部に代えて、データベースに登録しようとする
文献から抽出された抽出索引語に関連する関連索引語を
前記シソーラスより検索するとともに所定の計算式によ
りこの関連索引語の前記文献に対する重要度を前記シソ
ーラスを参照して計算して関連索引語の関速度順一覧を
作成する関連索引語検索部とし、抽出された抽出索引語
の一覧と作成された関連索引語の関速度順一覧とを結果
表示・選択部に表示させて、この結果表示・選択部によ
り選択された索引語を、登録しようとする前記文献とと
もに前記データベースに格納させるように構成した。In the invention set forth in claim 2, instead of the related index word search unit of the invention set forth in claim 1, the thesaurus is searched for related index words related to extracted index words extracted from documents to be registered in the database. and a related index term search unit that calculates the importance of this related index term with respect to the document using a predetermined calculation formula with reference to the thesaurus and creates a list of related index terms in order of relative speed, and extracts the extracted index term. and the created list of related index terms in order of relative speed are displayed in the result display/selection section, and the index terms selected by the result display/selection section are stored in the database together with the document to be registered. It was configured to be stored.

作用結果表示・選択部による索引語の選択時に、登録しよう
とする文献から抽出された抽出索引語とともに、関連索
引語検索部によりシソーラスから検索された関連索引語
も表示されて選択に供されるため、文献中身外の語句で
あっても関連するものを適宜索引語としてその文献に付
与してデータベース化させることができる。よって、利
用者は文献の持つ概念をより適切に表現でき、文献検索
に際しての索引語の選択の余地が広がり、検索の再現率
が向上するものとなる。When an index word is selected by the action result display/selection section, the related index word searched from the thesaurus by the related index word search section is displayed along with the extracted index word extracted from the document to be registered for selection. Therefore, even if the words and phrases are outside the content of a document, related words can be appropriately assigned to the document as index terms and compiled into a database. Therefore, the user can more appropriately express the concept of the document, and the room for selecting index words when searching for documents is expanded, and the recall rate of the search is improved.

特に、関連索引語の重要度をシソーラスを参照して計算
し関連索引語の関速度順一覧を作成して、抽出索引語の
一覧とともに結果表示させて選択に供することにより、
関連の大きい関連索引語について落ちのない登録が可能
となり、より適切な索引語付与ができる。よって、従来
方式では検索条件の不完全性により検索洩れとなってい
たような文献についても検索可能となる。In particular, by calculating the importance of related index terms with reference to a thesaurus, creating a list of related index terms in order of relative speed, and displaying the results along with a list of extracted index terms for selection.
It is possible to register all related index terms that are highly related, and more appropriate index terms can be assigned. Therefore, it becomes possible to search for documents that would otherwise have been missed due to incomplete search conditions in the conventional method.

実施例本発明の一実施例を第１図に基づいて説明する。Example An embodiment of the present invention will be described based on FIG.

第２図で示した部分と同一部分は同一符号を用い、説明
も省略する。本実施例は、第２図のシステム構成に加え
、まず、既存の全ての索引語とともに、各索引語間の関
係情報を保持したシソーラス５が設けられている。ここ
に、索引語間の関係情報とは、例えば上位語／下位語／
関連語等に関する情報である。また、索引自動抽出部２
により文書ファイルを解析して得られる索引語候補（抽
出索引語）についてこのシソーラス５を参照して関連す
る関連索引語を検索する関連索引語抽出部６が設けられ
ている。Components that are the same as those shown in FIG. 2 are designated by the same reference numerals, and explanations thereof will be omitted. In this embodiment, in addition to the system configuration shown in FIG. 2, a thesaurus 5 is provided which holds all existing index terms as well as relationship information between each index term. Here, the relationship information between index words includes, for example, hypernym/hyponym/
This is information about related words, etc. In addition, the index automatic extraction unit 2
A related index word extraction unit 6 is provided which searches for related index words by referring to the thesaurus 5 for index word candidates (extracted index words) obtained by analyzing a document file.

このような構成において、登録すべき文書が索引自動抽
出部２に入力されると、形態素解析が行われ各文章が単
語単位に分割される。これらの単語群に対して表記のゆ
れの除去／同義語の統一表記への変換／不要語の除去が
行われ、索引語候補（抽出索引語）が生成される。つい
で、関連索引語検索部６はシソーラス５を参照して、抽
出索引語に対して関連する関連索引語を検索する。検索
後、登録しようとする文書に対するこれらの関連索引語
の重要度を所定の計算式により算出する。In such a configuration, when a document to be registered is input to the automatic index extraction section 2, morphological analysis is performed and each sentence is divided into words. For these word groups, removal of spelling variations, conversion of synonyms to unified notation, and removal of unnecessary words are performed to generate index word candidates (extracted index words). Next, the related index word search unit 6 refers to the thesaurus 5 and searches for a related index word related to the extracted index word. After the search, the importance of these related index terms for the document to be registered is calculated using a predetermined formula.

重要度は、例えば下記のような計算式により求ぬれる。The degree of importance can be determined, for example, using the following formula.

まず、重要度は［０，１］の間の値をとり、０は無関係
を示し、１は最も関係があることを示す。First, the importance takes a value between [0, 1], where 0 indicates unrelated and 1 indicates most related.

二二に、登録しようとする文書ｄから抽出された索引語
の重要度はｌとされる。また、この文書ｄ中に存在しな
い索引語ｉの重要度Ｒｄ（ｉ）は、文書ｄから抽出され
た全ての索引語数をＮ、索引語ｉがシソーラス５におい
て前記関係情報を持つ索引語ｊの集合をＡｄ（ｉ）、索
引語ｊの文書ｄ中の出現頻度をＮｄ（ｊ）、文書ｄ中の
索引語の延べ出現頻度数をＮａ１ｌとすると、Ｒｄ（ｉ）＝　Σ　Ｎ　ｄ　（ｊ　）／　Ｎａ１ｌＪε
Ａｄ（ｉ）により求められる。また、関係省（上位／下位／関連等
）の違いによって仮想的な文書中の出現頻度を定義でき
る。例えば、上位関係は０，９、下位関係は０．３、関
連語関係は１．０を、各々文書中の出現頻度Ｎｄ（ｊ）
に乗じた値Ｎ’ｄ（ｊ）を上式のＮ　ｄ　（ｊ　）に代
入し、また、Ｎ　ａ　ｌ　ｌにはＮ′ｄ（ｊ）の総和Ｎ
’　ａｌｌ　を代入すればよい。Second, the importance of the index word extracted from the document d to be registered is assumed to be l. Furthermore, the importance level Rd(i) of an index word i that does not exist in this document d is calculated as follows: N is the total number of index words extracted from document d, and the importance level Rd(i) of index word i that does not exist in this document d is determined by If the set is Ad(i), the frequency of appearance of index word j in document d is Nd(j), and the total frequency of appearance of index word in document d is Na1l, then Rd(i) = Σ N d (j) / Na1lJε
It is determined by Ad(i). Furthermore, the frequency of appearance in a virtual document can be defined depending on the difference in related ministries (superior/lower/related, etc.). For example, the appearance frequency Nd(j) in a document is 0,9 for a superordinate relationship, 0.3 for a subordinate relationship, and 1.0 for a related word relationship.
Substitute the value N'd(j) multiplied by
' Just substitute all.

このような重要度の計算後、関連索引語について関速度
順一覧が作成され、結果表示・選択部３では、抽出索引
語の一覧とともにこの関連索引語の関速度順一覧が利用
者に表示され、選択に供される。After such importance calculation, a list of related index words in order of relative speed is created, and the result display/selection unit 3 displays this list of related index words in order of relative speed to the user along with a list of extracted index words. , offered for selection.

発明の効果本発明は、上述したように索引語間の関連情報を保持し
たシソーラスと、文献中の抽出索引語に対する関連索引
語を検索する関連索引語検索部とを設けたので、結果表
示・選択部による索引語の選択時に、抽出索引語ととも
に関連索引語も表示されて選択に供されるため、文献中
以外の語句であっても関連するものを適宜索引語として
その文献に付与してデータベース化させることができ、
よって、利用者は文献の持つ概念をより適切に表現でき
１文献検索に際しての索引語の選択の余地を広げること
ができ、検索の再現率を向上させることｊτでき、特に
、請求項２記載の発明によれば、関連索引語の重要度を
シソーラスを参照して計算し関連索引語の関速度順一覧
を作成して、抽出索引語の一覧とともに結果表示させて
選択に供するので、重要度の高い関連索引語について落
ちのない登録が可能となり、より適切な索引語付与がで
き、よって、従来方式では検索条件の不完全性により検
索洩れとなっていたような文献についても検索可能とす
ることができる。Effects of the Invention As described above, the present invention includes a thesaurus that holds related information between index words and a related index word search section that searches for related index words for extracted index words in documents. When the selection unit selects an index term, related index terms are displayed together with the extracted index term for selection, so even if the words are not in the document, related terms can be added to the document as an index term as appropriate. It can be made into a database,
Therefore, the user can more appropriately express the concept of the document, expand the scope for selecting index words when searching for documents, and improve the recall rate of the search. According to the invention, the importance of related index terms is calculated with reference to a thesaurus, a list of related index terms in order of relative speed is created, and the results are displayed together with a list of extracted index terms for selection. It is possible to register highly related index terms without omission, and to assign more appropriate index terms, thereby making it possible to search for documents that would otherwise have been missed due to incomplete search conditions in the conventional method. I can do it.

[Brief explanation of drawings]

第１図は本発明の一実施例を示すブロック図、第２図は
従来例を示すブロック図である。FIG. 1 is a block diagram showing one embodiment of the present invention, and FIG. 2 is a block diagram showing a conventional example.

Claims

[Claims] 1. A thesaurus that holds index terms and relational information between each index term, and a search from the thesaurus for related index terms related to extracted index terms extracted from documents to be registered in the database. The extracted index words and searched related index words are displayed in the result display/selection part and selected by the result display/selection part. 1. An indexing support device, characterized in that said index words are stored in said database together with said documents to be registered. 2. A thesaurus that holds index terms and relational information between each index term, and a search for related index terms related to the extracted index terms extracted from the documents to be registered in the database from the thesaurus and a predetermined calculation formula. The system includes a related index word search unit that calculates the importance of this related index term with respect to the document with reference to the thesaurus and creates a list of related index terms in order of relative speed, and a result display/selection unit. The list of extracted index terms and the created list of related index terms in order of relative speed are displayed in the result display/selection unit, and the index terms selected by the result display/selection unit are registered. An indexing support device characterized in that the indexing support device is configured to store the same information in the database.