JPH0619970A

JPH0619970A - Key word extracting system

Info

Publication number: JPH0619970A
Application number: JP4173941A
Authority: JP
Inventors: Yukiko Horie; 由記子堀江
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-07-01
Filing date: 1992-07-01
Publication date: 1994-01-28

Abstract

PURPOSE:To fairly and efficiently extract key words. CONSTITUTION:An input means 1 converts a type part except graphics and charts, etc., to a readable form by utilizing an OCR or the like. A data extraction part 12 reads data from the input means 1 and outputs a key word candidate from a dictionary provided beforehand and the predetermined kinds of stop words. A data maintenance part 13 retreives a data holding part 5 relating to the extracted key word candidate, newly stores the key word candidate in the data holding part 5 when it is not present on the data holding part 5 and increases the appearance number of time of counter of the key word candidate by '1' when it is present. An output part 14 outputs the key word candidate for which the value of the appearence number of times of counter of the extracted key word is more than a predetermined value as the key word.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は情報検索システムにおけ
るフルテキストサーチによるキーワード抽出方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extraction method by full text search in an information retrieval system.

【０００２】[0002]

【従来の技術】従来の情報検索システムにおけるキーワ
ード抽出方法は、ドキュメントの作成者や、第３者の専
門家によって個別に行われていた。2. Description of the Related Art A conventional keyword extraction method in an information retrieval system has been individually performed by a document creator or a third party expert.

【０００３】[0003]

【発明が解決しようとする課題】上述した従来のキーワ
ード抽出方法は、ドキュメントの作成者や、第３者の専
門家の主感的な判断に頼ってしまい、本人が最も述べた
い重要な事項については押さえることができるが、それ
以外の関連事項などに目が届かないという問題があっ
た。The above-described conventional keyword extraction method relies on the sensible judgments of the creator of the document and the experts of the third party, and the important matters that the person wants to state most Although I can suppress it, there was a problem that I could not pay attention to other related matters.

【０００４】また、対象ドキュメントに対する専門知識
が不足している場合、内容を完全に理解することが困難
なため、抽出すべきキーワードを見落してしまったり、
時間を費やしてしまうという問題があった。If the target document is lacking in specialized knowledge, it is difficult to completely understand the contents, and some keywords to be extracted may be overlooked.
There was the problem of spending time.

【０００５】[0005]

【課題を解決するための手段】第１の発明は、情報検索
システムにおけるキーワード抽出方式において、ＯＣＲ
等を利用し図や表等を除いた活字部分を読み出し可能な
形式へ変換する入力手段と、前記入力手段からデータを
読み出し予め備えた辞書及び予め定められた種類のスト
ップワードからキーワード候補を出力するデータ抽出部
と、抽出された前記キーワード候補に関し予め備えたデ
ータ保持部を検索し前記データ保持部に存在しなければ
前記キーワード候補を前記データ保持部に新たに格納し
存在すれば前記キーワード候補の出現回数カウンタを１
増加させるデータ保守部と、抽出された前記キーワード
候補の前記出現回数カウンタの値が予め決められた値以
上の前記キーワード候補をキーワードとして出力する出
力手段を備えたことを特徴とする。A first invention is an OCR in a keyword extraction method in an information retrieval system.
Input means for converting the type part except for figures and tables into a readable format by using the above, and reading keyword data from the input means and outputting keyword candidates from a dictionary and a stop word of a predetermined type. And a data holding unit that is provided in advance for the extracted keyword candidate, and if the data holding unit does not exist, the keyword candidate is newly stored in the data holding unit, and if there is, the keyword candidate. 1 occurrence counter
It is characterized by further comprising: a data maintenance unit for increasing the number; and an output unit for outputting, as a keyword, the keyword candidate in which the value of the appearance frequency counter of the extracted keyword candidate is a predetermined value or more.

【０００６】[0006]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will now be described with reference to the drawings.

【０００７】図１は本発明の一実施例を示すブロック
図、図２は本実施例におけるデータ保持部５の動作の流
れを示す図である。FIG. 1 is a block diagram showing an embodiment of the present invention, and FIG. 2 is a diagram showing a flow of operation of the data holding unit 5 in this embodiment.

【０００８】入力手段１は、ＯＣＲ等を利用し、図や表
等を除いた活字部分を読み出し可能な形式へ変換する機
能と、入力データ解読部３に蓄積されたストップワード
を更新する機能を有する。The input means 1 has a function of using OCR or the like to convert a type part except for figures and tables into a readable format and a function of updating the stop word accumulated in the input data decoding section 3. Have.

【０００９】データ抽出部１２は、読み出し指示部２に
よって入力手段１からデータを読み出し、入力データ解
読部３で辞書機能により単語を認識しながら、助詞、冠
詞、句読点などのストップワードを検出するまで読み出
し続ける。The data extraction unit 12 reads data from the input unit 1 by the read instruction unit 2 and recognizes words by the dictionary function in the input data decoding unit 3 until detecting stop words such as particles, articles, and punctuation marks. Continue reading.

【００１０】データ保守部１３は、抽出された単語をデ
ータ保持部５に格納されたデータと比較演算部４により
比較し、存在しなければデータ保持部５に新たに格納
し、存在すれば出現回数カウンタを１増加させる。The data maintenance unit 13 compares the extracted word with the data stored in the data holding unit 5 by the comparison operation unit 4, newly stores it in the data holding unit 5 if it does not exist, and appears it if it exists. Increment the frequency counter by 1.

【００１１】出力部１４は、登録部９により抽出された
データの出現頻度によってレベル分け（重み付け）し、
キーワードとなるものを選出し、出力手段１０に出力す
る。制御部６は、メモリ制御部７を介してメモリ８に格
納されたプログラムを実行し、システム全体を制御す
る。The output unit 14 classifies (weights) the data extracted by the registration unit 9 according to the frequency of appearance,
A keyword is selected and output to the output means 10. The control unit 6 executes the program stored in the memory 8 via the memory control unit 7, and controls the entire system.

【００１２】次に、図２を用いて本実施例の動作を説明
する。Next, the operation of this embodiment will be described with reference to FIG.

【００１３】読み出し指示部２により、文頭“ま”から
１文字すづ読み出し、“まいごの”まで読み出したとこ
ろで、入力データ解読部３が“まいご”という単語と
“の”というストップワードを検出すると、比較演算部
４により“まいご”という単語が既にデータ保持部５に
格納されているかどうかを確認する。既に登録されてい
れば、カウンタを１つ増やし、されていなければデータ
保持部５に格納する。この例では、“まいご”という単
語はまだ格納されていないとすると、データ保持部５に
格納しカウンタの値を１とする（ステップ１）。When the read instruction unit 2 reads out one character from the beginning of the sentence, "Maigo no", the input data decoding unit 3 detects the word "Mago" and the stop word "No". The comparison operation unit 4 confirms whether or not the word “maigo” is already stored in the data holding unit 5. If it is already registered, the counter is incremented by one, and if it is not registered, it is stored in the data holding unit 5. In this example, if the word "maigo" is not stored yet, it is stored in the data holding unit 5 and the value of the counter is set to 1 (step 1).

【００１４】以上の動作を繰り返し、ドキュメントの最
後までくると（ステップ４）、登録部９により予め決め
られた値以上の重みを有する単語をキーワードとして指
定し、出力手段１０に出力する。When the above operation is repeated and the end of the document is reached (step 4), a word having a weight of a predetermined value or more is designated by the registration unit 9 as a keyword and output to the output means 10.

【００１５】[0015]

【発明の効果】以上説明したように、本発明は、キーワ
ードを決められた基準で、自動的に抽出できるようにし
たことにより、公正に効率良くキーワード抽出が実行さ
れる効果がある。As described above, according to the present invention, keywords can be automatically extracted on the basis of a predetermined standard, so that keywords can be extracted fairly efficiently.

[Brief description of drawings]

【図１】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図２】本実施例におけるデータ保持部の動作の流れを
示す図である。FIG. 2 is a diagram showing a flow of operations of a data holding unit in the present embodiment.

[Explanation of symbols]

１入力手段２読み出し指示部３入力データ解読部４比較演算部５データ保持部６制御部７メモリ制御部８メモリ９登録部１０出力手段１２データ抽出部１３データ保守部１４出力部 DESCRIPTION OF SYMBOLS 1 Input means 2 Read instruction section 3 Input data decoding section 4 Comparison calculation section 5 Data holding section 6 Control section 7 Memory control section 8 Memory 9 Registration section 10 Output means 12 Data extraction section 13 Data maintenance section 14 Output section

Claims

[Claims]

1. A keyword extraction method in an information retrieval system, comprising input means for converting a type part except for figures and tables into a readable format by using OCR and the like, and reading data from the input means and preliminarily provided. A data extraction unit that outputs a keyword candidate from a dictionary and a predetermined type of stopword, and a data holding unit that is provided in advance with respect to the extracted keyword candidate are searched, and if the data holding unit does not exist, the keyword candidate is found. A data maintenance unit that increases the appearance count counter of the keyword candidate by 1 if it is newly stored in the data holding unit and a value of the appearance count counter of the extracted keyword candidate is a predetermined value or more. A method for extracting a keyword, characterized by having an output means for outputting a keyword candidate as a keyword .