JPH09190450A

JPH09190450A - Information processor and its method

Info

Publication number: JPH09190450A
Application number: JP8018071A
Authority: JP
Inventors: Shogo Shibata; 昇吾柴田; Makoto Hirota; 誠廣田; Shiro Ito; 史朗伊藤; Takanari Ueda; 隆也上田; Yuji Ikeda; 裕治池田; Minoru Fujita; 稔藤田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1996-01-09
Filing date: 1996-01-09
Publication date: 1997-07-22

Abstract

PROBLEM TO BE SOLVED: To provide an information processor whereby even a keyword corresponding to personal interest while picking-up a general keyword in a composition. SOLUTION: In a document processor, a document holding part 1 stores an inputted document and a sentence segmenting part 2 segments a sentence from the document. A word segmenting part 3 segments a word from the segmented sentence. A user keyword holding part 4 stores a user keyword at every person. A user keyword pick-up part 5 picks-up the segmented word as a keyword candidate when it is the user keyword. An unrequired word list holding part 6 stores an unrequired word being the word which does not become the keyword generally such as '(matter)' and '(thing)', etc. A keyword candidate pick-up part 7 recognizes that the word segmented by the word segmenting part 3 is not the unrequired word and picks-up as the keyword candidate. A keyword candidate holding part 8 holds the keyword candidates picked-up by the user keyword pickup part 5 and the keyword candidate pickup part 7. A keyword selecting part 9 converges the proper number of keywords from the keyword candidates and its result is provided to a user by an output part 10.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書からその文章
を代表するキーワードを抽出する情報処理装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information processing apparatus for extracting a keyword representing a sentence from a document.

【０００２】[0002]

【従来の技術】従来、ワークステーションやパーソナル
コンピュータにおいて文章中からキーワードを抽出する
方法が検討されている。キーワードは検索を目的とした
索引的な役割として有用であるが、文章を理解しなくて
はできない知的作業であるので、人間にとってもキーワ
ードを正確に付与することは難しい作業である。2. Description of the Related Art Conventionally, a method of extracting a keyword from a sentence in a workstation or a personal computer has been studied. Although the keyword is useful as an index role for the purpose of retrieval, it is an intellectual task that requires understanding the sentence, so it is difficult for humans to accurately assign the keyword.

【０００３】キーワードを計算機で自動的に抽出する方
法では、まず、文章中から名詞を切り出し、それぞれの
頻度を調べる。次に、切り出された名詞の中から汎用的
に頻出する単語（キーワードとなり得ない単語）を排除
する。最後に、残った名詞の中から文章中での出現位置
などを考慮してキーワードとして採用するか否かを決定
する。In the method of automatically extracting a keyword by a computer, first, a noun is cut out from a sentence and the frequency of each is examined. Next, words that frequently appear in general (words that cannot be keywords) are excluded from the cut out nouns. Finally, from the remaining nouns, it is decided whether or not to adopt it as a keyword in consideration of the appearance position in the sentence.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来例には以下に掲げる問題があり、その改善が要望され
ている。すなわち、上記キーワードの抽出方法は、文章
に書かれた情報のみを使用するので、書き手の考えたキ
ーワードを取り出すことを目的としている。一般に、文
章は不特定の読者を想定して書かれているので、この抽
出方法では背景知識のない人が使うような一般的なキー
ワードしか抽出できないと考えられる。人それぞれに異
なった背景知識があり、人によって興味が異なる。した
がってある単語がある人にとってキーワードであって
も、他の人にはキーワードにならないということがある
はずである。However, the above-mentioned conventional examples have the following problems, and their improvement is desired. That is, since the above-mentioned keyword extraction method uses only the information written in the text, it is intended to extract the keyword considered by the writer. In general, sentences are written assuming unspecified readers, so it is thought that this extraction method can only extract general keywords used by people without background knowledge. Each person has different background knowledge, and different people have different interests. Therefore, there should be times when a word is a keyword for some people, but not for others.

【０００５】例えば、市場動向を調査している人が新聞
を読む際、注目している企業名を見逃すことは許されな
い。その企業名が何度も出てくる記事では、一般のキー
ワードとなるが、片隅に一度だけ出てくる記事ではキー
ワードとならない可能性が高い。後者の場合、人間も見
逃す危険性が高いので、このような場合にこそコンピュ
ータが適切にキーワードとして抽出すべきである。For example, when a person who is investigating market trends reads a newspaper, he cannot afford to miss the name of the company he is paying attention to. Articles in which the company name appears many times are used as general keywords, but articles that appear only once in one corner are unlikely to be keywords. In the latter case, there is a high risk that even humans will miss it, so it is only in this case that the computer should properly extract it as a keyword.

【０００６】そこで、本発明は、文章中の一般のキーワ
ードを抽出しつつ個人の興味に応じたキーワードも抽出
できる情報処理装置を提供することを目的とする。Therefore, an object of the present invention is to provide an information processing apparatus capable of extracting general keywords in a sentence and also extracting keywords according to individual interests.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するため
に、本発明の請求項１に係る情報処理装置は、入力され
た文書から文を切り出す文切出手段と、該切り出された
文から単語を切り出す単語切出手段と、ユーザによって
登録された単語をユーザキーワードとして保持するユー
ザキーワード保持手段と、該保持されたユーザキーワー
ドと前記切り出された単語とを比較し、該ユーザキーワ
ードと一致する単語をキーワード候補として抽出するユ
ーザキーワード抽出手段と、キーワードにふさわしくな
い単語が登録された不要語リストを保持する不要語リス
ト保持手段と、該不要語リストに登録された単語と前記
切り出された単語とを比較し、一致しない場合、前記切
り出された単語を前記キーワード候補として抽出するキ
ーワード候補抽出手段と、該抽出されたキーワード候補
を保持するキーワード候補保持手段と、該保持されたキ
ーワード候補の中から前記キーワードを選択するキーワ
ード選択手段とを備えたことを特徴とする。In order to achieve the above object, an information processing apparatus according to claim 1 of the present invention is a sentence cutting-out means for cutting out a sentence from an input document, and a sentence cutting-out means for cutting out the sentence. A word cutting-out means for cutting out a word, a user keyword holding means for holding a word registered by a user as a user keyword, and the held user keyword and the cut-out word are compared, and match the user keyword. User keyword extracting means for extracting words as keyword candidates, unnecessary word list holding means for holding an unnecessary word list in which words not suitable for keywords are registered, words registered in the unnecessary word list and the cut-out words And if they do not match, the keyword candidate extraction for extracting the cut-out word as the keyword candidate A keyword candidate holding means for holding the stage, the extracted keyword candidates, characterized in that a keyword selection means for selecting the keyword from among the keyword candidates the holding.

【０００８】請求項２に係る情報処理装置は、請求項１
に係る情報処理装置において前記抽出されるキーワード
候補に重要度に応じた得点を付与する得点付与手段を備
え、前記キーワード選択手段は、該付与された得点にし
たがって前記キーワードを選択することを特徴とする。An information processing apparatus according to claim 2 is the information processing apparatus according to claim 1.
In the information processing apparatus according to the above, the keyword candidate is provided with a score assigning unit for assigning a score according to the importance, and the keyword selecting unit selects the keyword according to the assigned score. To do.

【０００９】請求項３に係る情報処理装置は、請求項２
に係る情報処理装置において前記ユーザキーワードに種
別を設け、特定の種別のユーザキーワードと一致する前
記キーワード候補に、前記得点付与手段によって得点が
付与されることなく、該キーワード候補を最重要キーワ
ードに設定し、前記キーワード選択手段は、該最重要キ
ーワードを前記キーワードとして選択することを特徴と
する。An information processing apparatus according to claim 3 is the information processing device according to claim 2.
In the information processing apparatus according to the above, a type is provided for the user keyword, and the keyword candidate that matches the user keyword of a specific type is set as the most important keyword without being scored by the score assigning means. The keyword selecting means selects the most important keyword as the keyword.

【００１０】請求項４に係る情報処理方法は、入力され
た文書から文を切り出し、該切り出された文から単語を
切り出し、ユーザによって登録された単語をユーザキー
ワードとして保持し、該保持されたユーザキーワードと
前記切り出された単語とを比較し、該ユーザキーワード
と一致する単語をキーワード候補として抽出し、キーワ
ードにふさわしくない単語が登録された不要語リストを
保持し、該不要語リストに登録された単語と前記切り出
された単語とを比較し、一致しない場合、前記切り出さ
れた単語を前記キーワード候補として抽出し、該抽出さ
れたキーワード候補を保持し、該保持されたキーワード
候補の中から前記キーワードを選択することを特徴とす
る。According to a fourth aspect of the information processing method, a sentence is cut out from an input document, a word is cut out from the cut out sentence, the word registered by the user is held as a user keyword, and the held user is held. The keyword is compared with the cut-out word, a word that matches the user keyword is extracted as a keyword candidate, an unnecessary word list in which words that are not suitable for the keyword are registered is held, and the unnecessary word list is registered. When a word is compared with the cut-out word and if they do not match, the cut-out word is extracted as the keyword candidate, the extracted keyword candidate is held, and the keyword is selected from the held keyword candidates. Is selected.

【００１１】[0011]

【発明の実施の形態】本発明の情報処理装置の実施の形
態について説明する。本実施の形態における情報処理装
置は文書処理装置に適用される。図１は実施の形態にお
ける文書処理装置の機能的構成を示すブロック図であ
る。図において、１はユーザの下に入ってきた文書を格
納している文書保持部、２は入力された文書から文を切
り出す文切り出し部、３は切り出された文から単語を切
り出す単語切り出し部である。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of an information processing apparatus according to the present invention will be described. The information processing device in this embodiment is applied to a document processing device. FIG. 1 is a block diagram showing the functional configuration of the document processing apparatus according to the embodiment. In the figure, 1 is a document holding unit that stores a document entered under the user, 2 is a sentence cutout unit that cuts out a sentence from an input document, and 3 is a word cutout unit that cuts out a word from the cut out sentence. is there.

【００１２】４は個人毎のキーワードとなるべき単語を
ユーザキーワードとして格納するユーザキーワード保持
部、５は単語切り出し部３で切り出した単語がユーザキ
ーワード保持部４のユーザキーワードであるか否かを判
定し、ユーザキーワードであるとき、それをキーワード
候補として抽出するユーザキーワード抽出部である。Reference numeral 4 denotes a user keyword holding unit for storing a word which should be a keyword for each individual as a user keyword, and 5 determines whether or not the word cut out by the word cutout unit 3 is a user keyword of the user keyword holding unit 4. When it is a user keyword, the user keyword extraction unit extracts the keyword as a keyword candidate.

【００１３】６は「こと」、「もの」などのように、ど
んな文書にも多く現われて一般にキーワードとならない
単語である不要語のリストを格納する不要語リスト保持
部である。７は単語切り出し部３で切り出した単語が不
要語リスト保持部６に入っていないことを確認し、キー
ワード候補として抽出するキーワード候補抽出部であ
る。Reference numeral 6 denotes an unnecessary word list holding unit for storing a list of unnecessary words, such as "koto" and "things", which are frequently found in any document and are not generally used as keywords. Reference numeral 7 denotes a keyword candidate extraction unit that confirms that the word cut out by the word cutout unit 3 is not in the unnecessary word list holding unit 6 and extracts it as a keyword candidate.

【００１４】８はユーザキーワード抽出部５とキーワー
ド候補抽出部７とで抽出されたキーワード候補を保持す
るキーワード候補保持部である。９はこれらのキーワー
ド候補の中から適切な数だけキーワードを絞り込むキー
ワード選択部であり、その結果が出力部１０でユーザに
提供される。Reference numeral 8 is a keyword candidate holding unit for holding the keyword candidates extracted by the user keyword extracting unit 5 and the keyword candidate extracting unit 7. Reference numeral 9 denotes a keyword selection unit that narrows down an appropriate number of keywords from these keyword candidates, and the result is provided to the user at the output unit 10.

【００１５】図２は文書処理装置のハードウェアの構成
を示すブロック図である。図において、１１は制御手順
を記憶する制御メモリであり、ＲＯＭやＲＡＭからな
る。１２は制御メモリ１１に記憶されている制御手順に
したがって処理を行う中央処理装置であり、文書保持部
１、ユーザキーワード保持部４、キーワード候補保持部
８を有する。FIG. 2 is a block diagram showing the hardware configuration of the document processing apparatus. In the figure, reference numeral 11 is a control memory that stores a control procedure, and includes a ROM and a RAM. A central processing unit 12 performs processing according to the control procedure stored in the control memory 11, and has a document holding unit 1, a user keyword holding unit 4, and a keyword candidate holding unit 8.

【００１６】１４はキーボードである。１５はディスク
であり、不要語リスト保持部６を有する。１６はディス
プレイであり、ＣＲＴや液晶ディスプレイ（ＬＣＤ）か
らなる。ディスプレイ１６はキーワードを表示するため
に使用される。１７は各構成要素を接続するバスであ
る。Reference numeral 14 is a keyboard. Reference numeral 15 is a disk, which has an unnecessary word list holding unit 6. A display 16 is composed of a CRT or a liquid crystal display (LCD). The display 16 is used to display the keywords. Reference numeral 17 is a bus for connecting each component.

【００１７】図３は文書処理装置のキーワード抽出処理
手順を示すフローチャートである。図５は処理対象とな
る文書およびその処理結果を示す説明図である。まず、
文書保持部１から取り出した文書に対して一文ずつ文を
切り出し（ステップＳ１）、切り出した文から単語を切
り取る（ステップＳ２）。この例文では、「２０日」、
「ＡＢＣ」、「社長」などの単語が切り取られる。単語
が動詞などの活用形である場合、単語を原形に戻す（ス
テップＳ３）。FIG. 3 is a flowchart showing the keyword extraction processing procedure of the document processing apparatus. FIG. 5 is an explanatory diagram showing a document to be processed and its processing result. First,
Sentences are cut out sentence by sentence from the document taken out from the document holding unit 1 (step S1), and words are cut out from the cut-out sentence (step S2). In this example sentence, "20 days",
Words such as "ABC" and "President" are clipped. When the word is in the inflected form such as a verb, the word is returned to the original form (step S3).

【００１８】各単語がユーザキーワード保持部４に登録
されているかいないかを判別する（ステップＳ４）。図
４はユーザキーワード保持部４に格納されている情報を
示す説明図である。ユーザキーワードには、必ず抽出す
る「最重要」と、他の語よりも重視する「重要」との２
つの種別がある。ここでは、「キヤノン」が最重要であ
り、「カラーコピー」、「カメラ」は重要であると設定
されている。It is determined whether or not each word is registered in the user keyword holding unit 4 (step S4). FIG. 4 is an explanatory diagram showing information stored in the user keyword holding unit 4. For user keywords, "most important" is always extracted, and "important" is emphasized more than other words.
There are two types. Here, "Canon" is set as the most important, and "Color copy" and "Camera" are set as important.

【００１９】つづいて、種別を判別し（ステップＳ
５）、「最重要」である場合、最重要キーワードを作成
する（ステップＳ６）。「重要」である場合、ユーザキ
ーワードの得点を付ける（ステップＳ７）。この得点
は、後述するステップＳ９で付けられる得点を単純に２
倍にしたものである。得点が付けられた単語は、キーワ
ード候補保持部８のキーワード候補リストに登録される
（ステップＳ１０）。図５の場合、第２文にユーザキー
ワードである「キヤノン」が存在するので、キーワード
候補として確定する。Subsequently, the type is determined (step S
5) If it is "most important", create the most important keyword (step S6). If it is "important", the user keyword is scored (step S7). This score is simply the score added in step S9, which will be described later.
It is doubled. The scored words are registered in the keyword candidate list of the keyword candidate holding unit 8 (step S10). In the case of FIG. 5, since the user keyword “Canon” exists in the second sentence, it is decided as a keyword candidate.

【００２０】ステップＳ４でユーザキーワードでないと
判別された単語は、不要語であるか否かを判別する（ス
テップＳ８）。不要語リスト保持部６の不要語リストは
単語のみ、または単語と品詞情報などからなる単純なリ
ストである。It is determined whether or not the word determined in step S4 as not a user keyword is an unnecessary word (step S8). The unnecessary word list of the unnecessary word list holding unit 6 is a simple list including only words or words and part-of-speech information.

【００２１】不要語であると判別された場合、ステップ
Ｓ２に戻り、次の単語を処理する。不要語でない場合、
キーワード候補として得点を付け（ステップＳ９）、キ
ーワード候補リストに登録される（ステップＳ１０）。If it is determined that the word is an unnecessary word, the process returns to step S2 to process the next word. If it is not an unnecessary word,
Scores are assigned as keyword candidates (step S9) and registered in the keyword candidate list (step S10).

【００２２】得点の付け方は種々の方法でよい。例え
ば、文書の冒頭、最終文であるとき３点、段落の先頭文
であるとき２点を付ける。また、名詞である場合、格助
詞に応じて「は・が格」のとき３点、「を格」のとき２
点、その他は１点となるように配点する。図５の第１文
中の「ＡＢＣ」は文書の冒頭文、かつ「は格」であるの
で、６点を有する。Various methods may be used for scoring. For example, 3 points are added at the beginning and the last sentence of a document, and 2 points are added at the beginning of a paragraph. Also, if it is a noun, it is 3 points when it is "ha ・ ga case" and 2 points when it is "wa case" according to the case particle
The points and others are allotted to be 1 point. Since "ABC" in the first sentence of FIG. 5 is the first sentence of the document and "ha case", it has 6 points.

【００２３】また、「カラーコピー」は冒頭文かつその
他の格であるのでまず４点を有し、しかもユーザキーワ
ードであるので、その得点を２倍した計８点を有する。The "color copy" has 4 points because it is the first sentence and other cases, and has 8 points in total because it is a user keyword and the score is doubled.

【００２４】以上示したように、得点が付けられたキー
ワード候補がすべてキーワード候補保持部８に格納され
る。キーワード候補は数が多いので、ユーザに提示する
場合に数を減らさなくてはならない。キーワードの個数
はあらかじめユーザによって与えられており、この場
合、文が短いので２個に設定する。As shown above, all keyword candidates with scores are stored in the keyword candidate holding unit 8. Since the number of keyword candidates is large, it is necessary to reduce the number when presenting it to the user. The number of keywords is given in advance by the user. In this case, since the sentence is short, it is set to two.

【００２５】キーワード候補を得点順に並べ（ステップ
Ｓ１１）、得点の高い方から２個をキーワードとして採
用する。ここでは、８点の「カラーコピー」と、２回出
現した「ＡＢＣ」（６点＋３点＝９点）がキーワードと
して作成される（ステップＳ１２）。The keyword candidates are arranged in the order of scores (step S11), and two keywords with the highest scores are adopted as keywords. Here, "color copy" of 8 points and "ABC" (6 points + 3 points = 9 points) that appear twice are created as keywords (step S12).

【００２６】ステップＳ６で作成した最重要キーワード
と、ステップＳ１２で作成したキーワードとを連結する
（ステップＳ１３）。この例では、最重要キーワードが
「キヤノン」、重要キーワードが「カラーコピー」、通
常のキーワードが「ＡＢＣ」である。The most important keyword created in step S6 and the keyword created in step S12 are connected (step S13). In this example, the most important keyword is “Canon”, the important keyword is “color copy”, and the normal keyword is “ABC”.

【００２７】この段階では、キーワードの並び方が得点
順であるので、全キーワードを分かりやすいように並び
替える（ステップＳ１４）。図５の場合、文書に出現し
た順序で並び替えている。並び替えた全キーワードをユ
ーザに提示して（ステップＳ１５）処理を終了する。At this stage, the keywords are arranged in score order, so that all the keywords are rearranged so that they are easy to understand (step S14). In the case of FIG. 5, the documents are rearranged in the order of appearance. All the rearranged keywords are presented to the user (step S15), and the process ends.

【００２８】尚、前記実施の形態では、キーワードの出
力の際、キーワードのみを表示していたが、本文全体を
表示させてキーワードの部分を強調表示するようにして
もよい。強調方法としては、フォントを変えたり、フォ
ントサイズを大きくしたり、下線を付けたり、文字の色
を変えるなどの方法がある。また、逆にキーワード以外
の文字を小さくしたり、薄い色にしたりして目立たなく
してもよい。In the above embodiment, only the keyword is displayed when outputting the keyword, but the entire body may be displayed and the keyword portion may be highlighted. As an emphasizing method, there are methods such as changing the font, increasing the font size, underlining, and changing the character color. On the contrary, the characters other than the keyword may be made small or lightly colored to make them inconspicuous.

【００２９】また、前記実施の形態では、キーワードを
並び換える際、文書の出現順にしたが、得点順など他の
並べ方にしてもよい。Further, in the above-mentioned embodiment, when the keywords are rearranged, the documents are arranged in the order of appearance, but other arrangements such as score may be arranged.

【００３０】さらに、前記実施の形態では、ユーザキー
ワードの種別を「最重要」、「重要」の２種類に分類し
たが、もっときめ細かく分類してもよい。細かく分類し
た場合、種別毎に得点の倍率など指定できるようにして
もよい。さらに、倍率でなく、一定の点を加えてもよ
い。Further, in the above embodiment, the types of user keywords are classified into two types, "most important" and "important", but they may be classified more finely. In the case of fine classification, a score multiplying factor may be designated for each type. Furthermore, a fixed point may be added instead of the magnification.

【００３１】また、前記実施の形態では、ユーザキーワ
ードの単位を単語としていたが、複合名詞や接辞などの
部分一致を考慮してもよい。例えば、助数詞の「円」を
登録すると、文書中の「５９，８００円」などの値段の
表現も確実に検出できる。Further, in the above embodiment, the unit of the user keyword is a word, but a partial match such as a compound noun or an affix may be considered. For example, by registering the classifier "yen", the expression of the price such as "59,800 yen" in the document can be surely detected.

【００３２】さらに、前記実施の形態では、検出するべ
きキーワードをユーザキーワードとしていたが、逆に無
視する単語も登録できるようにして一般キーワードにな
らないようにしてもよい。Further, in the above-mentioned embodiment, the keyword to be detected is the user keyword, but conversely, the word to be ignored may be registered so that it is not a general keyword.

【００３３】また、前記実施の形態では、日本語の文書
についてキーワード抽出を行っていたが、英語、日本語
以外の言語の文書を対象にしてもよい。Further, in the above-mentioned embodiment, the keyword extraction is performed for the Japanese document, but a document in a language other than English or Japanese may be targeted.

【００３４】さらに、本発明は複数の機器から構成され
るシステムに適用してもよいし、１つの機器からなる装
置に適用してもよい。また、本発明はシステムあるいは
装置にプログラムを供給することによって達成される場
合にも適用できることはいうまでもない。この場合、本
発明を達成するためのソフトウェアによって表されるプ
ログラムを格納した記憶媒体をシステムあるいは装置に
読み出すことによってそのシステムあるいは装置が本発
明の効果を享受することが可能となる。Furthermore, the present invention may be applied to a system composed of a plurality of devices or may be applied to an apparatus composed of one device. Needless to say, the present invention can be applied to a case where the present invention is achieved by supplying a program to a system or an apparatus. In this case, by reading out a storage medium storing a program represented by software for achieving the present invention into a system or an apparatus, the system or the apparatus can enjoy the effects of the present invention.

【００３５】[0035]

【発明の効果】本発明の請求項１に係る情報処理装置に
よれば、文切出手段により入力された文書から文を切り
出し、単語切出手段により該切り出された文から単語を
切り出し、ユーザキーワード保持手段によりユーザによ
って登録された単語をユーザキーワードとして保持し、
ユーザキーワード抽出手段により該保持されたユーザキ
ーワードと前記切り出された単語とを比較し、該ユーザ
キーワードと一致する単語をキーワード候補として抽出
し、不要語リスト保持手段によりキーワードにふさわし
くない単語が登録された不要語リストを保持し、キーワ
ード候補抽出手段により該不要語リストに登録された単
語と前記切り出された単語とを比較し、一致しない場
合、前記切り出された単語を前記キーワード候補として
抽出し、キーワード候補保持手段により該抽出されたキ
ーワード候補を保持し、キーワード選択手段により該保
持されたキーワード候補の中から前記キーワードを選択
するので、文章中の一般のキーワードを抽出しつつ個人
の興味に応じたキーワードも抽出できる。したがって、
個人の興味を考慮したキーワードを抽出できるようにな
り、キーワードを見て本文を読むべきか否かの判断を適
切にできる。また、キーワードで文書を検索する場合で
も、必要な文書をもれなく検索できる。According to the information processing apparatus according to the first aspect of the present invention, a sentence is cut out from the document input by the sentence cutout means, and a word is cut out from the cutout sentence by the word cutout means. Holds the word registered by the user by the keyword holding means as a user keyword,
The user keyword extracting means compares the held user keyword with the cut-out word, extracts a word that matches the user keyword as a keyword candidate, and the unnecessary word list holding means registers a word not suitable for the keyword. Holding the unnecessary word list, comparing the words registered in the unnecessary word list by the keyword candidate extraction means and the cut out words, if they do not match, extract the cut out words as the keyword candidates, The extracted keyword candidates are held by the keyword candidate holding means, and the keyword is selected from the held keyword candidates by the keyword selecting means. Therefore, it is possible to extract general keywords in the sentence and meet the individual interest. You can also extract keywords. Therefore,
It becomes possible to extract keywords that take personal interests into consideration, and appropriately determine whether to read the text by looking at the keywords. Further, even when searching for documents by keywords, all necessary documents can be searched for.

【００３６】請求項２に係る情報処理装置によれば、前
記抽出されるキーワード候補に重要度に応じた得点を付
与する得点付与手段を備え、前記キーワード選択手段
は、該付与された得点にしたがって前記キーワードを選
択するので、個人の興味を考慮したキーワードの選択の
基準を明確化でき、選択されたキーワードに応じて選択
の基準を変更することもできる。According to the information processing apparatus of the second aspect, there is provided a score assigning means for assigning a score according to the degree of importance to the extracted keyword candidates, and the keyword selecting means according to the assigned score. Since the keywords are selected, it is possible to clarify the criteria for selecting the keywords in consideration of individual interests, and it is possible to change the criteria for the selection according to the selected keywords.

【００３７】請求項３に係る情報処理装置によれば、前
記ユーザキーワードに種別を設け、特定の種別のユーザ
キーワードと一致する前記キーワード候補に、前記得点
付与手段によって得点が付与されることなく、該キーワ
ード候補を最重要キーワードに設定し、前記キーワード
選択手段は、該最重要キーワードを前記キーワードとし
て選択するので、極めて重要度の高い単語を確実にキー
ワードとして抽出でき、見逃すことを防止できる。According to the information processing apparatus of claim 3, a type is provided for the user keyword, and the keyword assigning means does not give a score to the keyword candidate that matches the user keyword of the specific type. Since the keyword candidate is set as the most important keyword and the keyword selecting means selects the most important keyword as the keyword, it is possible to reliably extract a word of extremely high importance as a keyword and prevent it from being overlooked.

【００３８】請求項４に係る情報処理方法によれば、入
力された文書から文を切り出し、該切り出された文から
単語を切り出し、ユーザによって登録された単語をユー
ザキーワードとして保持し、該保持されたユーザキーワ
ードと前記切り出された単語とを比較し、該ユーザキー
ワードと一致する単語をキーワード候補として抽出し、
キーワードにふさわしくない単語が登録された不要語リ
ストを保持し、該不要語リストに登録された単語と前記
切り出された単語とを比較し、一致しない場合、前記切
り出された単語を前記キーワード候補として抽出し、該
抽出されたキーワード候補を保持し、該保持されたキー
ワード候補の中から前記キーワードを選択するので、文
章中の一般のキーワードを抽出しつつ個人の興味に応じ
たキーワードも抽出できる。According to the information processing method of the fourth aspect, a sentence is cut out from the input document, a word is cut out from the cut out sentence, the word registered by the user is held as a user keyword, and the word is held. Comparing the extracted user word with the extracted word, and extracting a word that matches the user keyword as a keyword candidate,
Holding an unnecessary word list in which words that are not suitable for keywords are registered, and comparing the words registered in the unnecessary word list with the cut out words, and if they do not match, the cut out words as the keyword candidates Since the extracted keyword candidates are held and the keyword is selected from the held keyword candidates, general keywords in the sentence can be extracted and also keywords according to the individual interest can be extracted.

[Brief description of the drawings]

【図１】実施の形態における文書処理装置の機能的構成
を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of a document processing apparatus according to an embodiment.

【図２】文書処理装置のハードウェアの構成を示すブロ
ック図である。FIG. 2 is a block diagram showing a hardware configuration of a document processing apparatus.

【図３】文書処理装置のキーワード抽出処理手順を示す
フローチャートである。FIG. 3 is a flowchart showing a keyword extraction processing procedure of the document processing apparatus.

【図４】ユーザキーワード保持部４に格納されている情
報を示す説明図である。FIG. 4 is an explanatory diagram showing information stored in a user keyword holding unit 4.

【図５】処理対象となる文書およびその処理結果を示す
説明図である。FIG. 5 is an explanatory diagram showing a document to be processed and a processing result thereof.

[Explanation of symbols]

２文切り出し部３単語切り出し部４ユーザキーワード保持部５ユーザキーワード抽出部６不要語リスト保持部７キーワード候補抽出部８キーワード候補保持部９キーワード選択部 2 sentence cutout unit 3 word cutout unit 4 user keyword holding unit 5 user keyword extraction unit 6 unnecessary word list holding unit 7 keyword candidate extraction unit 8 keyword candidate holding unit 9 keyword selection unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者上田隆也東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者池田裕治東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者藤田稔東京都大田区下丸子３丁目30番２号キヤノン株式会社内 ─────────────────────────────────────────────────── ─── Continued Front Page (72) Inventor Takaya Ueda 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inc. (72) Inventor Yuji Ikeda 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Incorporated (72) Inventor Minoru Fujita 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inc.

Claims

[Claims]

1. A sentence segmentation unit for segmenting a sentence from an input document, a word segmentation unit for segmenting a word from the segmented sentence, and a user keyword holding unit for retaining a word registered by a user as a user keyword. And a user keyword extracting means for comparing the held user keyword with the cut-out word and extracting a word matching the user keyword as a keyword candidate, and an unnecessary word list in which a word not suitable for the keyword is registered. Unnecessary word list holding means for holding the word, and a keyword candidate extracting means for comparing the word registered in the unnecessary word list with the cut-out word, and extracting the cut-out word as the keyword candidate when they do not match. A keyword candidate holding means for holding the extracted keyword candidates; The information processing apparatus being characterized in that a keyword selection means for selecting the keyword from among the keyword candidates.

2. The method according to claim 1, further comprising: score giving means for giving a score according to the degree of importance to the extracted keyword candidates, wherein the keyword selecting means selects the keyword according to the given score. The information processing apparatus according to claim 1.

3. A type is set for the user keyword, and the keyword candidate is set as the most important keyword without being scored by the score assigning means to the keyword candidate that matches the user keyword of a specific type. 3. The information processing apparatus according to claim 2, wherein the keyword selection unit selects the most important keyword as the keyword.

4. A sentence is cut out from an input document, a word is cut out from the cut out sentence, a word registered by a user is held as a user keyword, and the held user keyword and the cut out word are stored. And a word matching the user keyword is extracted as a keyword candidate, and an unnecessary word list in which words that are not suitable for the keyword are registered is held, and the word registered in the unnecessary word list and the cut-out word are compared with each other. If the two do not match, the extracted word is extracted as the keyword candidate, the extracted keyword candidate is held, and the keyword is selected from the held keyword candidates. Information processing method.