JP3460728B2

JP3460728B2 - Document search method

Info

Publication number: JP3460728B2
Application number: JP13407293A
Authority: JP
Inventors: 泰嗣小川; 礼子別所
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-08-14
Filing date: 1993-05-12
Publication date: 2003-10-27
Anticipated expiration: 2018-10-27
Also published as: JPH06208588A

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、文書検索方式に関し、より詳細
には、検索語が文書内の語と全く同じでなくとも、該当
文書と見なすことができるようにした文書検索方式に関
する。例えば、文書管理装置や画像管理装置などに適用
されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method, and more particularly, to a document search method that allows a search word to be regarded as a corresponding document even if the search word is not exactly the same as the word in the document. For example, it is applied to a document management device, an image management device, and the like.

【０００２】[0002]

【従来技術】本発明に係る従来技術を記載した公知文献
としては以下のものがある。特開平２−２４５８号公報
に提案されている「類似文書検索装置」は、キーワード
を持っていない文書についても、その文書を形態素解析
などをすることで、自動的にキーワードを抽出して所望
の文書を検索できるようにしたもので、検索語を入力す
ると、それに対し類似度の高い文書を出力し、あらかじ
め文書にキーワードが付与されていなくても、文書から
自立語を抽出し、頻度の高いものから順にキーワードと
し、検索語と比較して類似度を判定するものである。し
かしながら、文書内に検索語と全く同じ語が含まれてな
ければ、該当文書と見なされないことになり、文書から
自立語を抽出し、頻度の高いものから順にキーワードと
し、検索語と比較する方法では、単に出現頻度の高い単
語ほど重要ということになり、正確な検索は行なえない
という欠点がある。2. Description of the Related Art The following are known documents which describe the prior art relating to the present invention. The “similar document retrieval apparatus” proposed in Japanese Patent Laid-Open No. 2-2458 / 1990 automatically extracts a keyword from a document that does not have a keyword by performing a morphological analysis, etc. This is a document search function. When a search word is input, a document with a high degree of similarity is output, and independent words are extracted from the document even if keywords are not added to the document in advance, and the frequency is high. The keywords are used in order from the one, and the similarity is determined by comparing with the search word. However, if the document does not contain the exact same word as the search term, it will not be considered as a relevant document, and independent words will be extracted from the document and used as keywords in descending order of frequency and compared with the search term. The method has the drawback that the more frequently appearing words are more important, and an accurate search cannot be performed.

【０００３】また、「意味属性に基づくテキストベース
検索方式」（松尾比呂志外１名情報処理学会編文誌
Vol32,No9,Sep.1991 p1172〜1179）は、多様な表現の類
似関係を扱うために、単語の意味属性に基づいて、検索
指示文を各テキストの見出し文との意味的類似性により
検索するものである。すなわち、見出し語のついた大量
のカードを格納したＤＢ（データベース）から、見出し
文をもとに目的のカードを取り出すもので、文書全体で
なく、見出し文をインデックスとして扱い、検索語と見
出し文の部分的な一致も認めるものである。しかしなが
ら、見出し文を検索の対象としているので、文書全体を
検索の対象とすることはできないという欠点がある。In addition, "text-based retrieval method based on semantic attributes" (Hiroshi Matsuo, 1)
Vol32, No9, Sep.1991 p1172-1179) deals with similarity relations of various expressions by searching the search instruction sentence by the semantic similarity with the headline sentence of each text based on the semantic attribute of the word. It is a thing. That is, the target card is retrieved from a DB (database) that stores a large number of cards with headwords based on the headline sentences. The headline sentences are treated as an index instead of the entire document, and the search words and the headline sentences are used. The partial agreement of is also recognized. However, since the headline sentence is the search target, there is a drawback that the entire document cannot be searched.

【０００４】[0004]

【目的】本発明は、上述のごとき実情に鑑みなされたも
ので、検索語が、文書内の語と全く同じでなくても該当
文書と見なすことができること、また、検索語に応じて
文書中のキーワードに得点を付与するので、正確な検索
を行なうことができること、さらに、文書全体（つまり
見出し文だけでなく）検索の対象とする文書検索方式を
提供することを目的としてなされたものである。[Purpose] The present invention has been made in view of the above circumstances, and it is possible to regard a search word as a corresponding document even if the search word is not exactly the same as the word in the document. Since the score is given to the keyword of, the purpose is to provide an accurate search, and to provide a document search method for searching the entire document (that is, not only the headline sentence). .

【０００５】[0005]

【構成】本発明は、上記の目的を達成するために、
（１）入力した検索語を形態素解析して品詞を付与した
単語列に分割する形態素解析手段と、前記形態素解析手
段により得られる単語列のうちキーワード素性を付与さ
れた各構成語に対しては、単語列の前に遡るに従って基
本点から順次点を上げていくようにして重要度を設定
し、該単語列のうちキーワード素性を付与されていない
各構成語に対しては、単語列の前に遡るに従って、キー
ワード素性を付与された各構成語に設定した重要度の合
計点より大きな点数から順次点を上げていくようにして
重要度を設定して、前記検索語の各構成語に重要度を設
定する重要度設定手段と、登録文書に付与されているキ
ーワードを構成する単語列の各構成語と前記検索語を構
成する単語列の構成語と一致した場合、この検索語の構
成語に設定された重要度からキーワードごとに一致度を
計算する一致度計算手段と、前記キーワードごとの一致
度から文書得点を文書ごとに計算する文書得点計算手段
と、前記文書得点順に文書に関する情報を出力する文書
出力手段とから成ること、更には、（２）前記（１）前
記重要度設定手段は、前記キーワード素性が複合語語基
である場合、前記単語列中のキーワード素性を付与され
ていない構成語の重要度より小さくすること、更には、
（３）前記（１）において、前記重要度設定手段は、前
記キーワード素性が固有名詞構成語である場合、前記単
語列中のキーワード素性を付与されていない構成語の重
要度より小さくすること、更には、（４）前記（１）に
おいて、前記重要度設定手段は、前記キーワード素性が
接頭修飾である場合、前記単語列中のキーワード素性を
付与されていない構成語の重要度より小さくすること、
更には、（５）前記（１）において、前記重要度設定手
段は、前記キーワード素性が地名識別語である場合、前
記単語列中のキーワード素性を付与されていない構成語
の重要度より小さくすること、更には、（６）前記
（１）において、前記重要度設定手段は、前記キーワー
ド素性が元号識別語である場合、前記単語列中のキーワ
ード素性を付与されていない構成語の重要度より小さく
すること、更には、（７）前記（１）において、前記一
致度計算手段は、キーワードの構成語の並び順と検索語
の構成語の並び順で一致した構成語の数に応じて一致度
を大きくすること、更には、（８）前記（１）におい
て、前記一致度計算手段は、キーワードと検索語の末尾
の構成語が一致する場合、一致度を大きくすること、更
には、（９）前記（１）において、前記一致度計算手段
は、キーワードと検索語の先頭の構成語が一致する場
合、一致度を大きくすること、更には、（１０）前記
（１）において、前記文書得点計算手段は、文書中のキ
ーワードの出現位置に応じてこのキーワードの一致度に
重み付けして文書得点を計算すること、更には、（１
１）前記（１）において、前記文書得点計算手段は、キ
ーワードの後続語に応じてこのキーワードの一致度に重
み付けして文書得点を計算することを特徴としたもので
ある。以下、本発明の実施例に基づいて説明する。In order to achieve the above object, the present invention comprises:
(1) Morphological analysis is performed on the input search word to divide it into word strings to which parts of speech are added, and keyword features are added to the word strings obtained by the morphological analysis means.
For each constituent word that has been
Set importance by gradually increasing points from this point
However, the keyword feature of the word string is not added
For each constituent word, the key is
The degree of importance set for each constituent word with word features
Try to increase the points in order from the points greater than the total points
An importance degree setting means for setting an importance degree and setting an importance degree for each constituent word of the search word, and each constituent word of a word string forming a keyword given to a registered document and the search word When the words match the constituent words of the word string, the matching score calculation means for calculating the matching score for each keyword from the importance set in the constituent words of the search word, and the document score for each document from the matching score of each keyword. The document score calculating means for calculating and the document output means for outputting the information about the document in the order of the document score, and further, (2) the (1 ) the importance setting means, the keyword setting means. When the feature is a compound word base, the keyword feature in the word string is assigned.
Not less than the importance of the constituent words that are not
(3) In the above (1), the importance setting unit, when the keyword feature is proper noun constituent word, the single
Making the keyword features in the word string less than the importance of the constituent words to which the keyword features are not added; and ( 4 ) in ( 1 ), the importance degree setting means adds the keyword features as prefixes. If it is a modification, the keyword features in the word string are
Be less than the importance of constituent words that are not assigned ,
Furthermore, (5) In the above (1), the importance setting unit, when the keyword feature is place name identification word, before
Constituent words that are not given keyword features in a word string
To be smaller than the importance of, further, (6) in the (1), the importance setting unit, when the keyword feature is era identifier words, keywords in the word sequence
To be smaller than the constituent word of importance that are not granted over de feature, furthermore, (7) in (1), said matching degree calculation means, keywords constituent word order of the search term constituent word The matching degree is increased according to the number of constituent words that match in the arrangement order of, and ( 8 ) in (1), the matching degree calculation means causes the keyword and the constituent word at the end of the search word to match. If you, increasing the degree of coincidence, further, (9) in the above (1), said matching degree calculation means, if the head of the constituent word of a keyword and the search word matches, increasing the degree of coincidence, Furthermore, ( 10 ) In the above (1), the document score calculating means calculates the document score by weighting the matching degree of the keyword in accordance with the appearance position of the keyword in the document, and further ( 1)
1 ) In the above item (1), the document score calculating means calculates the document score by weighting the matching degree of the keyword according to the subsequent word of the keyword. Hereinafter, description will be given based on examples of the present invention.

【０００６】図１は、本発明による文書検索方式の一実
施例を説明するための構成図で、図中、１は検索語入力
手段、２は文書得点付与手段、３は文書ランキング手
段、４は文書出力手段、５はキーワードが付与された文
書である。まず、ユーザによって検索語が入力される。
次に、文書得点付与手段２によって、その入力された検
索語に応じた得点が各文書に付与される。なお、ここで
はあらかじめ単語単位に区切られ、キーワードが付与さ
れた文書５が用意されているものとする。次に、文書ラ
ンキング手段３によって、得点が付与された文書を得点
の高い順にソートし、文書出力手段４によって出力され
る。FIG. 1 is a block diagram for explaining an embodiment of a document search system according to the present invention. In the figure, 1 is a search word input means, 2 is a document score giving means, 3 is a document ranking means, 4 Is a document output unit, and 5 is a document to which a keyword is added. First, a user inputs a search term.
Next, the document score assigning means 2 assigns a score to each document according to the input search word. Note that, here, it is assumed that the document 5 is prepared in which words are preliminarily sectioned and the keywords are added. Next, the document ranking means 3 sorts the documents to which the scores have been added in descending order of score, and the document output means 4 outputs the sorted documents.

【０００７】図２は、図１における文書得点付与手段の
動作を説明するためのフローチャートである。step１；検索語を形態素解析にかけ、各単語に品詞を付
与する。step２；それらの各単語に対して、ルールに従って重要
度を与える。step３；各文書のもつキーワードの単語と、検索語の単
語が一部分でも一致したら、さきに検索語の単語に付与
した重要度を与え、そのキーワードごとに重要度を合計
し、キーワードの一致度を計算する。step４；各文書ごとに一致度を合計し、その文書の得点
とする。FIG. 2 is a flow chart for explaining the operation of the document score giving means in FIG. step1 ； The search word is subjected to morphological analysis and a part of speech is added to each word. step2 ； Give importance to each of these words according to the rules. step3 ； If the word of the keyword that each document has and the word of the search word partially match, the degree of importance given to the word of the search word is given, and the degree of importance is summed up for each keyword, and the degree of matching of the keyword is calculated. calculate. step4 ； The degree of coincidence is summed up for each document, and it is set as the score of the document.

【０００８】図２において、「重要度」とは、検索語を
形態素解析してその一語一語に対して付与する値であ
る。「一致度」とは、文書中のキーワードと検索語（部
分）が一致するとそれに相当する検索語の重要度が付与
され、単語ごとに合計された値である。「得点」とは最
終的に一致度が文書ごとに合計されたときの値である。In FIG. 2, "importance" is a value given to each word by morphologically analyzing the search word. The “coincidence degree” is a value obtained by adding the degree of importance of a search word corresponding to a keyword and a search word (part) in the document, and summing up for each word. The “score” is a value when the degree of coincidence is finally summed up for each document.

【０００９】図３は、検索語に対する重要度付与ルール
を説明するためのフローチャートである。なお、前述の
ように検索語は形態素解析され、品詞分解されているも
のとする。まず、最初に重要なことは、ポインタを最後
尾におくことである（step１）。つまり、単語列の最後
尾から順に前に戻りながら処理していくことになる。最
初にｎの値に基本点、sum の値に０をセットする（step
２）。次に、その単語にキーワード素性が付与されてい
るかどうかを判断する（step３）。ここで、付与されて
いるものと付与されていないものに分けられるが、付与
されているものは図３の破線の上の部分の処理（ここで
は phase１と呼ぶ）、付与されていないものは破線の
下の部分の処理（ここでは phase２と呼ぶ）が行なわ
れることになる。キーワード素性については後述する。FIG. 3 is a flow chart for explaining the importance giving rule for a search word. As described above, it is assumed that the search word is morphologically analyzed and decomposed into parts of speech. First, the first important thing is to put the pointer at the end (step 1). In other words, the processing is performed while returning backward from the end of the word string. First, set n as the base point and sum as 0 (step
2). Next, it is determined whether or not the word has a keyword feature (step 3). Here, it is divided into those that are given and those that are not given, but those that have been given are the processing of the part above the broken line in FIG. 3 (herein called phase 1), and those that have not been given are broken lines. The processing of the lower part (here called phase2) will be performed. The keyword features will be described later.

【００１０】最初に phase１、つまりキーワード素性が
付与されているものについての処理を説明する。まず、
そのキーワード素性が「接頭修飾」かどうかを判断する
（step４）。「接頭修飾」とは、後述するが、後続する
語を修飾するはたらきをもつ接頭辞である。「接頭修
飾」がないならば、その単語にｎをセットする（step
５）。そしてsum の値にｎを加算し、ｎの値に１を加算
する（step６）。そしてその単語が単語列の先頭かどう
かを判断し（step７）、先頭でなければ１単語前に戻り
（step８）、step３に戻って同じ処理を繰り返す。つま
り、単語列の前に進むほどｎおよび sum の値が大きく
なる。先頭であれば、ここでキーワード素性の付与され
たものについての処理は終了し、最後尾にもどって（st
ep１１）phase２の処理に入る。なお、step４でキーワ
ード素性が「接頭修飾」であったものについては、その
語の基本点をセットし（step９）、sum に基本点を加算
する（step１０）。First, the processing for phase 1, that is, a keyword feature is given will be described. First,
It is judged whether the keyword feature is "prefix modification" (step 4). The “prefix modification” is a prefix having a function of modifying a subsequent word, which will be described later. If there is no "prefix modification", set n to that word (step
5). Then, n is added to the value of sum and 1 is added to the value of n (step 6). Then, it is judged whether or not the word is the beginning of the word string (step 7), and if it is not the beginning, the process returns to the previous word (step 8), returns to step 3, and repeats the same processing. In other words, the values of n and sum increase as the position moves forward in the word sequence . If it is at the beginning, the processing for the keyword feature added ends here, and returns to the end (st
ep11) Start processing of phase2. When the keyword feature is "prefix modification" in step 4, the basic point of the word is set (step 9), and the basic point is added to sum (step 10).

【００１１】次に、phase２の処理にうつる。step１１
で最後尾に戻ったら、phase１で合計してきた sum に１
を加算する（step１２）。次に、phase１と同様にキー
ワード素性の有無を調べる（step１３）。実際には素性
のあるものはすでに phase１で処理されているので、こ
こでは素性の無いものが対象となる。素性のあるものは
単語列の先頭かどうかを確かめ（step１６）、処理を終
了する。さて、step１３で素性の無いものはその単語に
sum をセットする（step１４）。そして次にいままで
の合計 sum にもう一度 sum を加え、さらに１を加算す
る（step１５）。そしてその単語が単語列の先頭かどう
かを判断し（step１６）、先頭でなければ１単語前に戻
り（step１７）、step１２に戻って同じ処理を繰り返
す。つまり、phase２では単語列の前に進むほど sum が
加算されていく。つまり、キーワード素性の付与された
ものは単語列の前に位置するものほど重要度は高くな
り、また、キーワード素性の付与されたものがどれだけ
加算されても（連なっても）キーワード素性の付与され
ない単語の、たとえ１語の方が重要度は高くなる。Next, the process of phase 2 is performed. step11
Then, when I returned to the end, I added 1 to the sum I summed up in phase 1.
Is added (step 12). Next, as in phase 1, the presence or absence of keyword features is checked (step 13). In fact, features with no features have already been processed in phase 1, so here we will target those with no features. If there is a feature, it is confirmed whether it is the head of the word string (step 16), and the process is terminated. By the way, if there is no feature in step 13,
Set sum (step 14). Then, sum is added again to the total sum up to now, and 1 is further added (step 15). Then, it is judged whether or not the word is the head of the word string (step 16), and if it is not the head, the process returns to the previous word (step 17), returns to step 12, and repeats the same processing. In other words, in phase 2, sum is added as it goes to the front of the word string. In other words, a keyword feature is assigned a higher importance as it is positioned before the word string, and a keyword feature is assigned no matter how many keyword features are added (even if they are consecutive). The importance of one word that is not read is higher.

【００１２】ここで、上記の説明でも用いたキーワード
素性について説明する。キーワード素性には、複合語語
基、固有名詞構成語、接頭修飾、地名識別、元号識別の
５種類がある。それぞれの素性が付与され得る品詞と特
徴、役割を次の表１にまとめる。Here, the keyword features used in the above description will be described. There are five types of keyword features: compound word base, proper noun constituent words, prefix modification, place name identification, and era identification. Table 1 below summarizes the parts of speech, the characteristics, and the roles that can be given to each feature.

【００１３】[0013]

【表１】 [Table 1]

【００１４】「接頭修飾」以外は、単独で出現した場合
キーワードとなりにくい、または識別性が薄いという特
徴をもつ。「装置」だけをみてもこれだけでは特徴のあ
る語とはいえない。また、「地名識別語」「元号識別
語」も同様である。「東京」といっても「東京大学」
「東京〇〇会社」「東京〇〇学校」「〇〇会社東京支
店」というように、一致する語は多く、「東京」単独で
は文書中にマッチする語は多数ある。そうした意図か
ら、これらキーワード素性の付与された語は単語列の前
に位置するにしても１点ずつしか重要度は上げなかっ
た。逆にキーワード素性のない一般名詞や固有名詞は s
um により重要度が高くなる。なお、「接頭修飾」は他
の素性とは少し異なる。通常、接頭辞はキーワードとは
見なされないほどだが、例えば「新」や「大」など後続
の語を修飾する働きが大きいと思われる接頭辞が「接頭
修飾」である。これらについては基本点だけを与えるこ
とにした。Other than the "prefix modification", when it appears alone, it is less likely to be a keyword, or has a characteristic that the distinctiveness is low. Even if we look only at "devices," this alone is not a characteristic word. The same applies to "place name identifiers" and "era name identifiers". "Tokyo" is "University of Tokyo"
There are many matching words such as "Tokyo company", "Tokyo school", "Tokyo company Tokyo branch", etc., and there are many matching words in the document for "Tokyo" alone. From such an intention, even if these words to which the keyword features are added are located in front of the word string, the importance is increased by one point at a time. On the other hand, the general noun or proper noun without a keyword feature is s
um makes it more important. "Prefix modification" is a little different from other features. Usually, the prefix is not considered to be a keyword, but the prefix that is considered to have a great effect on modifying subsequent words such as "new" and "large" is "prefix modification". For these, I decided to give only the basic points.

【００１５】次に、以下の語が検索語となった場合を例
にとって、上のルールを説明する。例１慶応大学医科学研究所．形態素解析して品詞単位に分解する。（形態素解析結果）慶応大学医科学研究所 → 慶応／大学／医／科学／研
究／所．ルールに従って単語ごとに重要度をつける。Next, the above rule will be described by taking the case where the following words are search words as an example. Example 1 Keio University Institute of Medical Science . Morphological analysis is performed and decomposed into parts of speech. (Morphological analysis results) Keio University Institute of Medical Science → Keio / University / Medicine / Science / Research / Place. Assign importance to each word according to the rules.

【００１６】[0016]

【表２】 [Table 2]

【００１７】重要度（得点）はこのように、まず単語列
の末尾の単語に基本点（ここでは２点）を与える。キー
ワード素性の付与された単語については、その直前の単
語に順次１点を加えていくという処理を繰り返す。キー
ワード素性のつかないもの（ここでは「慶応」）は、そ
れまでの重要度の全ての合計にさらに１を加える。これ
は、たとえ「大学医科学研究所」というキーワードを含
む文書が存在したとしても、「慶応」というキーワード
を含む文書の方が重要と見なすためである。例２新素材研究開発．形態素解析して品詞単位に分解する。（形態素解析結果）新素材研究開発 → 新／素材／研究／開発．ルールに従って単語ごとに重要度をつける。As described above, the importance (score) is calculated as follows.
The base point (here, 2 points) is given to the last word of. Key
For words with word features,
The process of adding one point to each word is repeated. Key
Words that do not have features (here, "Keio") are
Add one more to the total of all priorities. this
Includes the keyword "University Institute of Medical Science"
Even if there is a document, the keyword "Keio"
This is because it is considered that the document including is more important. Example 2Research and development of new materials ． Morphological analysis is performed and decomposed into parts of speech. (Morphological analysis result) New material R & D → New / Material / Research / Development ． Assign importance to each word according to the rules.

【００１８】[0018]

【表３】 [Table 3]

【００１９】接頭辞の扱いと、キーワード素性の付与さ
れていない語が単語列の先頭以外にある場合の扱いの例
である。キーワード素性「接頭修飾」の付与された接頭辞は、
付与されない接頭辞とは点数上で差をつけるため、基本
点（２点）を与える。例１ではキーワード素性のないものは単語列の先頭に
あったので、最後尾の単語列の重要度から順に計算して
いた。この例２はキーワード素性のない語（この場合
「素材」）が単語列の中ほどにあるが、流れは同じであ
る。その単語に対しての重要度を最も重くしたいので、
それ以外の語の重要度の合計にさらに１を加えて「素
材」の重要度とした。An example of the handling of the prefix and the handling of the word to which the keyword feature is not given is at a position other than the beginning of the word string. The prefix with the keyword feature "prefix modification" is
A basic point (2 points) is given in order to make a difference in score from the prefix not given. In Example 1, those without a keyword feature were at the beginning of the word string, and were therefore calculated in order from the importance of the last word string. In this example 2, the word having no keyword feature (in this case, "material") is in the middle of the word string, but the flow is the same. I want to give the most importance to that word, so
The importance of "material" was added by adding 1 to the total importance of other words.

【００２０】ここまでで、図２のstepの２の処理が終了
したことになる。こうして検索語に重要度が付与され
た。次に、この重要度を用いて文書ごとに得点を与え
る。得点は、図２のstep３，step４で述べたように、各
文書のキーワードの単語と検索語の単語が一致したら
（たとえ部分一致でも）検索語の単語に付与した重要度
を与え、各単語の一致度を求め、最終的にそれら一致度
を合計することによって得られる。前述の例２「新素材
研究」を用いて得点付与の方法を説明する。つまり、
「新素材研究」を検索語とした場合である。もう一度こ
の検索語の単語ごとの重要度を示す。Up to this point, the processing of step 2 in FIG. 2 has been completed. In this way, the search terms are given importance. Next, a score is given for each document using this importance. As described in step 3 and step 4 of FIG. 2, when the word of the keyword of each document and the word of the search word match (even if they partially match), the score is given to the importance of the word of the search word. It is obtained by finding the degree of coincidence and finally summing them. The method of scoring will be described using the above-mentioned Example 2, “New Material Research”. That is,
This is the case when "new material research" is used as the search term. The importance of each word of this search term is shown once again.

【００２１】[0021]

【表４】 [Table 4]

【００２２】次に、ある文書に次のようなキーワードが
記述されていたとする。このとき、文書中の各キーワー
ドは次のように一致度が算出される。Next, it is assumed that the following keywords are described in a certain document. At this time, the degree of coincidence of each keyword in the document is calculated as follows.

【００２３】[0023]

【表５】 [Table 5]

【００２４】一致度が算出されたら、文書ごとのその一
致度を合計する。この値がその文書の得点である。例え
ば、この文書でいえば１３＋１１＋１０＝３４というこ
とになり、得点は３４点ということになる。こうして全
ての文書に得点が付与されたら文書ランキング手段によ
って得点がソートされ、得点の高い文書から文書出力手
段によって出力される。When the degree of coincidence is calculated, the degree of coincidence for each document is summed up. This value is the score for the document. For example, if example have in this document will be referred to as 13 + 11 + 10 = 34, score it comes to 34 points. In this way, when the scores are given to all the documents, the scores are sorted by the document ranking means, and the documents having the highest scores are outputted by the document output means.

【００２５】図４は、本発明による文書検索方式の他の
実施例を説明するための図で、図中、１１は検索語入力
手段、１２は文書得点付与手段、１３は文書ランキング
手段、１４は文書出力手段である。検索語入力手段１１
は、ユーザの検索語を入力する。文書得点付与手段１２
は、入力検索語に応じた得点が全登録文書に対して付与
される。なお、各登録文書にはあらかじめ単語単位に区
切られているキーワードが付与されている。文書ランキ
ング手段１３は、登録文書を文書得点の高い順にソート
する。文書出力手段１４は、ユーザに検索結果を出力す
る。FIG. 4 is a diagram for explaining another embodiment of the document search system according to the present invention. In the figure, 11 is a search word input means, 12 is a document score giving means, 13 is a document ranking means, and 14 is a document ranking means. Is a document output means. Search term input means 11
Enter the user's search term. Document score giving means 12
Is given to all registered documents according to the input search word. It should be noted that keywords registered in advance in units of words are added to each registered document. The document ranking unit 13 sorts the registered documents in descending order of document score. The document output unit 14 outputs the search result to the user.

【００２６】図５は、図４における文書得点付与手段の
構成図で、図中、２１は形態素解析手段、２２は重要度
付与手段、２３は一致度計算手段、２４は文書得点計算
手段、２５は登録文書である。形態素解析手段２１は検
索語を形態素解析にかけ、各単語に品詞を付与する。重
要度付与手段２２において、重要度とは、検索語の形態
素解析した結果得られる各単語に付与される各単語の重
要性を表す値である。後述するルールに従って各単語ご
とに重要度を計算する。一致度計算手段２３において、
一致度とは、登録文書２５に付与されている各キーワー
ドと検索語の一致の程度を表す値である。文書得点計算
手段２４において、文書得点とは、登録文書と検索語の
一致の程度を表す値である。登録文書に付与されている
各キーワードと検索語の一致度から計算される。FIG. 5 is a block diagram of the document score assigning means in FIG. 4, in which 21 is a morpheme analyzing means, 22 is an importance assigning means, 23 is a matching score calculating means, 24 is a document score calculating means, and 25 is a document score calculating means. Is a registered document. Calls to the morphological analysis means 21 morphological analysis of the search terms, to grant a part of speech for each word. In the importance assigning means 22, the importance is a value representing the importance of each word given to each word obtained as a result of morphological analysis of search words. The importance is calculated for each word according to the rule described later. In the coincidence calculation means 23,
The degree of matching is a value indicating the degree of matching between each keyword assigned to the registered document 25 and the search word. In the document score calculation means 24, the document score is a value indicating the degree of coincidence between the registered document and the search word. It is calculated from the degree of coincidence between each keyword assigned to the registered document and the search term.

【００２７】以下、各手段について具体的に説明する。重要度付与手段まず、検索語の形態素解析した結果得られる各単語に付
与される重要度の計算方法を説明する。重要度はつぎの
ルールにしたがって計算される。検索語ないで最も語尾に近い品詞群１の単語の重要度
は基本点とする。それ以外の品詞群１の重要度は、その位置より最も近
い後方にある品詞群１の重要度に増加分を加えた値とす
る。キーワード素性「接頭修飾」付の接頭辞の重要度は基
本点とする。キーワード素性「接頭修飾」なしの接頭辞の重要度は
０とする。品詞群２の重要度は、（１）の品詞群１の重要度の合
計、（２）接頭修飾付の接頭語の重要度、（３）その位
置より後方にある品詞群２の重要度の合計の３つを合計
に増加分を加えた値とする。上述以外の単語の重要度は０とする。Each means will be specifically described below. Importance degree giving means First, a method of calculating the importance degree given to each word obtained as a result of morphological analysis of a search word will be described. Importance is calculated according to the following rules. The importance of the word in the part-of-speech group 1 which is the closest to the end without any search word is the basic point. The degree of importance of the part-of-speech group 1 other than that is set to a value obtained by adding an increment to the degree of importance of the part-of-speech group 1 that is closest to the position and rearward. The importance of the prefix with the keyword feature "prefix modification" is the basic point. The importance of the prefix without the keyword feature "prefix modification" is 0. The importance of part-of-speech group 2 is the sum of the importances of part-of-speech group 1 in (1), (2) the importance of a prefix with prefix modification, and (3) the importance of part-of-speech group 2 behind the position. Three of the totals is the value obtained by adding the increment to the total. The importance of words other than the above is 0.

【００２８】ただし、品詞群１とは、（１）キーワード
素性「複合語語基」付の一般名詞、（２）キーワード素
性「固有名詞構成語」付の固有名詞、（３）キーワード
素性「地名識別語」付の固有名詞、（４）キーワード素
性「元号識別語」付の固有名詞である。品詞群２とは、
（１）キーワード素性なしの一般名詞、（２）キーワー
ド素性なしの固有名詞、（３）数詞、（４）接尾辞、
（５）未登録語などである。キーワード素性はつぎの表
６のようにまとめられる。However, the part-of-speech group 1 is (1) a general noun with the keyword feature “compound word base”, (2) a proper noun with the keyword feature “proper noun constituent word”, and (3) a keyword feature “place name”. It is a proper noun with "identifier", and (4) keyword feature proper noun with "era identifier." What is part-of-speech group 2?
(1) General nouns without keyword features, (2) Proper nouns without keyword features, (3) Numbers, (4) Suffix,
(5) An unregistered word or the like. The keyword features are summarized in Table 6 below.

【００２９】[0029]

【表６】 [Table 6]

【００３０】このルールによる重要度付与の処理フロー
は前述した図３に示してある。以下、重要度付の例を以
下の表７、表８に示す。なお、ここでは基本点を２点、
増加分を１点としている。The processing flow for assigning importance according to this rule is shown in FIG. 3 described above. Hereinafter, examples with importance are shown in Tables 7 and 8 below. There are 2 basic points here,
The increment is 1 point.

【００３１】[0031]

【表７】 [Table 7]

【００３２】[0032]

【表８】 [Table 8]

【００３３】一致度計算手段つぎに、検索語の各単語の重要度をもとにキーワードと
検索語の一致度の計算方法を説明する。前述した図１，
図２の実施例では、キーワードに含まれる単語と一致す
る検索語の単語の重要度の合計を一致度としていた。こ
れに対し、キーワードに含まれる単語と一致する検索語
の単語の重要度の積を一致度とする。以下に、重要度付
与の例２に示した「新素材研究開発」を検索語として、
キーワードを「新素材研究」「素材研究」「研究素材」
と変えた場合の一致度計算の例を以下の表９に示す。 Matching Calculator Next, a method of calculating the matching between the keyword and the search word based on the importance of each word of the search word will be described. Figure 1 above
In the embodiment of FIG. 2, the degree of coincidence is the sum of the degrees of importance of the search words that match the words included in the keyword. In contrast, the product of the importance of the words in the search word matches a word included in the keyword and the degree of coincidence. Below, as a search term, "new material research and development" shown in the example 2 of assigning importance,
Keyword "new material research""materialresearch""researchmaterial"
Table 9 below shows an example of calculation of the degree of coincidence in the case of changing the above.

【００３４】[0034]

【表９】 [Table 9]

【００３５】また、一致度計算手段でキーワードに含ま
れる単語並びと検索語に含まれる単語並びとが一致する
場合に一致度が大きくなる。そのため、新たに「隣接
点」を導入し、キーワードに含まれる単語並びと検索語
に含まれる単語並びとが一致ごとに一致度に隣接点をか
けることとする。再び、検索語を「新素材研究開発」と
してキーワードを変えた場合の一致度の計算例を以下の
表１０に示す。なお、ここでは隣接点を２点としてい
る。[0035] In addition, the matching degree calculation means ing a large degree of coincidence in the case where the word line that is included in the word as well as search words included in the keyword matches in. Therefore, a new "adjacent point" is introduced, and the adjacency point is multiplied by the degree of coincidence for each match between the word sequence included in the keyword and the word sequence included in the search word. Again, Table 10 below shows an example of calculation of the degree of coincidence when the keyword is changed and the search term is “new material research and development”. Here, the adjacent points are two.

【００３６】[0036]

【表１０】 [Table 10]

【００３７】キーワードが「新素材研究」の場合、検索
語とキーワードが完全に一致しており、構成単語におい
て「新」と「素材」および「素材」と「研究」の並びが
ともに一致している。したがって、一致度の計算におい
て隣接点（２点）を２回かけている（表１０では、アン
ダーラインで示している）。表９では「素材研究」と
「研究素材」に対する一致は同じで２４となっていた。
しかし、表１０ではキーワードと検索語の語順を考慮す
るため、検索語「新素材研究開発」と部分的に語順の一
致する「素材研究」の一致度が２倍され、４８となって
いる。When the keyword is “new material research”, the search word and the keyword are completely the same, and the sequences of “new” and “material” and “material” and “research” are the same in the constituent words. There is. Therefore, adjacent points (2 points) are multiplied twice in the calculation of the degree of coincidence (indicated by underlining in Table 10). In Table 9 , the agreement for “material research” and “research material” was the same, which was 24.
However, in order to take into account the search terms of word order and keyword Table 10, search word and "new materials research and development" partially match of word order match the degree of "material research" is two-fold, and has a 48 .

【００３８】また、キーワードと検索語が完全に一致す
る際に一致度が検索語に含まれる単語数に応じて変わら
ないようにする。そのため、新たに「正規化係数」を導
入し、キーワードと検索語が完全一致する場合に一致度
が正規化係数になるようにする。まず、検索語の構成単
語の重要度から検索語の得点を計算する。検索語得点は
キーワードが検索語に等しい場合の一致度である。例え
ば、一致度計算法が表１０の方式であれば、検索語「新
素材研究開発」の検索語得点は２×３×２×２×２×２
＝７６８となる。正規化はキーワードと検索語の一致度
を検索語得点文書で割り、正規化係数をかけることで行
なう。例えば、正規化係数を１０００点とし、検索語と
キーワードが一致する場合の一致度はつぎの表１１のよ
うになる。[0038] In addition, so as not to change depending on the number of words in which the degree of coincidence when the keywords and search terms match completely contained in the search term. Therefore, a "normalization coefficient" is newly introduced so that the degree of coincidence becomes the normalization coefficient when the keyword and the search word completely match. First, the score of the search word is calculated from the importance of the constituent words of the search word. The search word score is the degree of matching when the keyword is equal to the search word. For example, if the matching degree calculation method is the method shown in Table 10 , the search word score of the search word “new material research and development” is 2 × 3 × 2 × 2 × 2 × 2.
= 768. Normalization is performed by dividing the degree of coincidence between the keyword and the search word by the search word score document and applying a normalization coefficient. For example, if the normalization coefficient is 1000 points and the search word and the keyword match, the degree of matching is as shown in Table 11 below.

【００３９】[0039]

【表１１】 [Table 11]

【００４０】正規化しない場合、検索語によって一致度
が異なっているが、正規化処理により検索語によらず一
致度が等しくなる。また、検索語を「新素材研究開発」
として、キーワードを変えた場合の一致度計算例を以下
の表１２に示す。When the normalization is not performed, the degree of coincidence differs depending on the search word, but the normalization process makes the degree of coincidence equal regardless of the search word. In addition, the search term is "new material research and development"
As an example, Table 12 below shows an example of calculation of the degree of coincidence when the keyword is changed.

【００４１】[0041]

【表１２】 [Table 12]

【００４２】文書得点計算手段最後に、キーワードと検索語の一致度をもとに文書得点
の計算方法を説明する。図１，図２に示す実施例では、
登録文書に付与されている各キーワードと検索語の一致
度の登録文書の全キーワードに関する和を文書得点とし
ていた。そのため、登録文書に付与されているキーワー
ド数が多いと文書得点が大きくなってしまう欠点があっ
たが、キーワード数に依存しにくい方式について説明す
る。最初の方式では、登録文書の各キーワードと検索語
の一致度の平均値を文書得点とする。すなわち、登録文
書の各キーワードと検索語の一致度の和をその文書のキ
ーワード数で割った値を文書得点とする。例として、
「新素材研究開発」を検索語、文書に付与されたキーワ
ードを「新素材研究」「素材研究」「研究素材」「リコ
ー」として場合を以下の表１３に示す。 Document Score Calculation Means Finally, a method for calculating the document score based on the degree of coincidence between the keyword and the search word will be described. In the embodiment shown in FIGS. 1 and 2,
The sum of the degree of coincidence between each keyword assigned to the registered document and the search word for all the keywords in the registered document was used as the document score. Therefore, although there is a drawback that the document score and a large number of keywords that have been granted in the registration document becomes large, to describe difficult method depends on the number of keywords
It In the first method, the average value of the degree of coincidence between each keyword of the registered document and the search word is used as the document score. That is, a value obtained by dividing the sum of the degrees of coincidence between each keyword of the registered document and the search word by the number of keywords of the document is used as the document score. As an example,
Table 13 below shows a case where “new material research and development” is used as a search word and keywords assigned to documents are “new material research”, “material research”, “research material”, and “Ricoh”.

【００４３】[0043]

【表１３】 [Table 13]

【００４４】この文書のキーワード数が４なので、一致
度の和を４で割っている。二つ目の方式では、登録文書
の各キーワードと検索語の一致度の和を一致度が１以上
となったキーワード数で割った値を文書得点とする。Since the number of keywords in this document is 4, the sum of coincidences is divided by 4. In the second method, the document score is a value obtained by dividing the sum of the degrees of coincidence between each keyword of the registered document and the search word by the number of keywords whose degree of coincidence is 1 or more.

【００４５】[0045]

【表１４】 [Table 14]

【００４６】二つ目の方式は、最初の方式とは異なり、
一致度が１以上となったキーワード数が３なので、一致
度の和を３で割っている。三つ目の方式では、登録文書
の各キーワードと検索語の一致度の最大値を文書得点と
する。The second method differs from the first method in that
Since the number of keywords for which the degree of coincidence is 1 or more is 3, the sum of the degrees of coincidence is divided by 3. In the third method, the maximum value of the matching degree between each keyword of the registered document and the search word is used as the document score.

【００４７】[0047]

【表１５】 [Table 15]

【００４８】次に、他の実施例について説明する。重要
度付与および一致度計算方式は前述の実施例と同じなの
で、説明を省略する。以下では、文書得点計算法を説明
する。文書得点計算とは、登録文書に付与されている各
キーワードと検索語の一致度から文書得点を計算するこ
とである。前述の実施例では複数の計算方式を提案した
が、以下では平均値方式を説明に用いる。ただし、最大
値方式などにも本発明で提案する方式を適用することは
可能である。平均値方式では、登録文書の各キーワード
と検索語の一致度の平均値を文書得点とする。例とし
て、「新素材研究開発」を検索語、文書に付与されたキ
ーワードを「新素材研究」「素材研究」「研究素材」
「リコー」とした場合を示す。Next, another embodiment will be described. The method of assigning importance and the method of calculating the degree of coincidence are the same as those in the above-mentioned embodiment, and therefore their explanations are omitted. The document score calculation method will be described below. The document score calculation is to calculate a document score from the degree of matching between each keyword added to the registered document and the search word. Although a plurality of calculation methods have been proposed in the above-described embodiments, the average value method will be used for the description below. However, the method proposed in the present invention can be applied to the maximum value method and the like. In the average value method, the average value of the coincidence between each keyword of the registered document and the search word is used as the document score. As an example, "new material research and development" is used as the search term, and the keywords added to the document are "new material research", "material research", and "research material".
The case of "Ricoh" is shown.

【００４９】[0049]

【表１６】 [Table 16]

【００５０】本発明のこの方式では、キーワードの出現
位置によって文書得点の計算結果が変わる。一般に文書
中の出現位置によってキーワードの重要性は異なるた
め、出現位置によって文書得点の計算結果を変えること
でユーザの要求にあった検索結果をもとめるのに有効で
ある。キーワードの出現位置がタイトルの場合、一致度
計算手段で得られる一致度（オリジナル一致度）にタイ
トル用係数をかけた値（重みつき一致度）を文書得点計
算に用いる。先ほどの例で、各キーワードの出現位置は
つぎの表に示す通りであったとする。ここでタイトル用
係数を２とした場合、タイトルに出現した「素材研究」
の重みつき一致度は６１×２＝１２２と計算される。そ
の結果、文書得点も以前の値と異なっている。[0050] In this method of the present invention, the calculation result of the document score by the appearance position of the keyword is changed. Generally, the importance of the keyword varies depending on the appearance position in the document, and it is effective to find the search result that meets the user's request by changing the calculation result of the document score depending on the appearance position. When the occurrence position of the keyword is the title, using the matching degree obtained by the matching degree calculating means a value obtained by multiplying the title coefficient (the original match degree) (weighted degree of matching) the document scoring. In the above example, assume that the appearance positions of the keywords are as shown in the following table. If the coefficient for the title is set to 2, the "material research" that appears in the title
The weighted coincidence is calculated as 61 × 2 = 122. As a result, the document score also differs from the previous value.

【００５１】[0051]

【表１７】 [Table 17]

【００５２】また、キーワードの出現位置がそれぞれ第
１段落第１文、第１段落第２文以降、第２段落以降第１
文、第２段落以降第２文以降の場合に係数をかけた重み
つき一致度を文書得点の計算に用いる。先ほどの例で、
第１段落第１文用係数を１.５、第１段落第２文以降用
係数を１.２、第２段落以降第１文用係数を１、第２段
落以降第２文以降用係数を０.８とした場合の文書得点
の計算をつぎの表１８に示す。[0052] In addition, the first paragraph first sentence occurrence position of the keyword, respectively, first paragraph second sentence later, the second paragraph after 1
Sentences, the second paragraph onward, the second sentence onward, the weighted coincidence degree multiplied by the coefficient is used to calculate the document score. In the example above,
The first paragraph first sentence coefficient is 1.5, the first paragraph second sentence and subsequent coefficients are 1.2, the second paragraph and subsequent first sentence coefficients are 1, and the second paragraph and subsequent sentence subsequent coefficients are Table 18 shows the calculation of the document score when the score is 0.8.

【００５３】[0053]

【表１８】 [Table 18]

【００５４】また、キーワードの後続語によって文書得
点の計算結果を変える。一般にキーワードの後続語によ
ってキーワードの重要性は異なるため、後続語によって
文書得点の計算結果を変えることでユーザの要求にあっ
た検索結果をもとめるのに有効である。キーワードの後
続語が格助詞「が」の場合、一致度計算手段で得られる
一致度（オリジナル一致度）に「が」用係数をかけた値
（重みつき一致度）を文書得点計算に用いる。先ほどの
例で、各キーワードの後続語はつぎの表に示す通りであ
ったとする。ここで「が」用係数を２とした場合、後続
語が「が」である「新素材研究」の重みつき一致度は２
４９×２＝４９８と計算される。その結果、文書得点も
以前の値と異なっている。[0054] In addition, changing the calculation result of the document score by subsequent language of keywords. Generally, the importance of a keyword varies depending on the succeeding word of the keyword. Therefore, it is effective to find the search result that meets the user's request by changing the calculation result of the document score depending on the succeeding word. If subsequent word keyword is case particle "ga", used match degree obtained by the matching degree calculating means a value obtained by multiplying the (original degree of coincidence) the "ga" coefficient a (weighted degree of matching) the document scoring . In the above example, assume that the succeeding words of each keyword are as shown in the following table. Here, when the coefficient for “ga” is set to 2, the weighted degree of coincidence of “new material research” whose subsequent word is “ga” is 2
It is calculated as 49 × 2 = 498. As a result, the document score also differs from the previous value.

【００５５】[0055]

【表１９】 [Table 19]

【００５６】キーワードの後続語がそれぞれ副助詞
「は」、格助詞「を」、格格助詞「が」副助詞「は」／
格助詞「を」以外（その他）の場合には、係数をかけた
重みつき一致度を文書得点の計算に用いる。先ほどの例
で、「は」用係数を１.５、「を」用係数を１、その他
用係数を０.５とした場合の文書得点の計算をつぎの表
２０に示す。[0056] subsequent word each sub-particle "is" in the keyword, case particle "wo", rated Case Markers "but" sub-particle "is" /
In the case other than the case particle "wo" (Other) uses a weighted degree of matching multiplied by the coefficient for the calculation of the document score. Table 20 below shows the calculation of the document score when the coefficient for "ha" is 1.5, the coefficient for "wa" is 1 and the coefficient for other is 0.5 in the above example.

【００５７】[0057]

【表２０】 [Table 20]

【００５８】上記で導入されたものをまとめて適用し、
文書得点を計算するようにしてもよい。先ほどの例で
は、つぎのように文書得点が計算される。Applying the above introduced together,
It is also possible to calculate the document score. In the previous example, the document score is calculated as follows.

【００５９】[0059]

【表２１】 [Table 21]

【００６０】以上に説明した文書検索方式では次のこと
を特徴とするものであった。ユーザが入力する検索語と文書に付与されているキー
ワードが部分的に一致する際にも検索できる。検索の際、検索語とキーワードの一致の程度（一致
度）が計算される。そのため、次のステップにしたがっ
て検索処理が実施される。Ｓ１；検索語を形態素解析することで単語分割する。Ｓ２；その単語ごとの重要度を設定する。Ｓ３；検索語とキーワードの共通する単語の重要度から
一致度を計算する。しかし、この方式はいくつかの改善点がある。（ａ）前記Ｓ２の重要度設定において、検索語を２回に
わたって後ろから前に走査する必要があった。そのた
め、重要度設定が複雑である。（ｂ）前記Ｓ３の一致度計算において、前記段落番号
（００２２）〜（００２４）では、キーワードと検索語
の単語の順序を無視していたため、単語順の異なるキー
ワードに対しても一致度が同じ値になる。例えば、この
方式では「素材研究」と「研究素材」のような同じ構成
単語から成る語順の異るキーワードを区別できなかっ
た。（ｃ）前記段落番号（００３５）に示すように、隣接点
を導入することで語順の異なるキーワードの区別ができ
るが、一致度の計算に積演算を用いていた。一般に、コ
ンピュータにおいて積演算は和演算よりも演算速度が遅
いため、この方式は文書検索が遅くなる。The document retrieval method described above is characterized by the following. The search can be performed even when the search word input by the user partially matches the keyword assigned to the document. At the time of search, the degree of matching (matching degree) between the search word and the keyword is calculated. Therefore, the search process is performed according to the following steps. S1: The search word is divided into words by morphological analysis. S2: The importance of each word is set. S3: The degree of coincidence is calculated from the degree of importance of the words having the search word and the keyword in common. However, this method has some improvements. (A) In the importance setting of S2, it is necessary to scan the search word twice from back to front. Therefore, the importance setting is complicated. (B) In the calculation of the degree of coincidence in S3, in paragraph numbers (0022) to (0024), since the order of the keyword and the word of the search word is ignored, the degree of coincidence is the same even for a keyword having a different word order. It becomes a value. For example, with this method, it is not possible to distinguish between "material research" and "research material" with different word order consisting of the same constituent words. (C) As shown in the paragraph number (0035), keywords having different word orders can be distinguished by introducing adjacent points, but a product operation is used to calculate the degree of coincidence. Generally, in a computer, the product operation is slower than the sum operation, so that the document search is slow in this method.

【００６１】以下に説明する実施例では、前記改善点
（ａ）については、検索語の走査を１回ですむようにす
る。改善点（ｂ）については、一致度計算において単語
順が一致する場合、単語順の一致に応じてボーナス得点
を与えるようにする。改善点（ｃ）については、一致度
計算に積演算を用いないようにするものである。図６
は、本発明による文書検索方式の更に他の実施例を説明
するための構成図で、図中、３１は文書検索手段、３２
は検索語入力手段、３３は文書得点付与手段、３４は文
書ソート手段、３５は文書出力手段、３６は索引語ファ
イル、３７は文書ファイル、３８は文書登録手段であ
る。In the embodiment described below, for the improvement point (a), the search word is scanned once. Regarding the improvement point (b), when the word order matches in the calculation of the matching score, a bonus score is given according to the word order match. Regarding the improvement point (c), the product operation is not used for the coincidence calculation. Figure 6
Is a block diagram for explaining still another embodiment of the document retrieval system according to the present invention, in which 31 is a document retrieval means and 32 is a document retrieval means.
The search term input means, the document score applying means 33, document sorting means 34, 35 document output means, is 36, which is the index word file, 37 document files, document registration means 38.

【００６２】文書登録手段３８は、ユーザが入力した文
書とそれに付与されているキーワードを文書ファイルと
索引語ファイルに保存する。１つの登録文書には複数の
キーワードが設定可能であり、１つのキーワードは複数
の構成単語からなる複合語であってもよい（例えば、
「文書検索」は「文書」と「検索」の２単語から構成さ
れる複合語である）。索引語ファイル３６では、登録文
書ごとの（複数の）キーワードを識別可能な構成をと
る。文書検索手段３１は、ユーザが入力した検索語に一
致する文書を索引語ファイル３６を用いて探しだし、結
果をユーザに提示する。文書検索は、検索語入力手段３
２と文書得点付与手段３３と文書ソート手段３４と文書
出力手段３５との４つの手段から構成されている。検索
語入力手段３２では、ユーザの検索語を入力する。文書
得点付与手段３３では、入力検索語に応じた得点を全登
録文書に対して計算する。文書ソート手段３４では、登
録文書を文書得点の高い順にソートする。文書出力手段
３５では、ユーザに検索結果を出力する。The document registration means 38 saves the document input by the user and the keywords attached thereto in the document file and the index word file. A plurality of keywords can be set in one registration document, and one keyword may be a compound word composed of a plurality of constituent words (for example,
"Document search" is a compound word composed of two words, "document" and "search". The index word file 36 has a structure in which (a plurality of) keywords for each registered document can be identified. The document search means 31 searches the index word file 36 for a document that matches the search word input by the user, and presents the result to the user. The document search is performed by the search word input means 3
2, a document score assigning means 33, a document sorting means 34, and a document output means 35. The search word input means 32 inputs the search word of the user. The document score assigning means 33 calculates a score according to the input search word for all registered documents. The document sorting means 34 sorts the registered documents in descending order of document score. The document output means 35 outputs the search result to the user.

【００６３】図７は、図６における文書得点付与手段の
構成図で、図中、４１は形態素解析手段、４２は重要度
設定手段、４３は文書得点計算手段、４４は一致度計算
手段である。形態素解析手段４１は検索語を形態素解析
し、単語に分割するとともに単語ごとに品詞を判定す
る。なお、本発明の文書検索装置では、ユーザの入力す
る検索語として複数の単語から構成される複合語を使用
できる。重要度設定手段４２において、重要度とは、検
索語の形態素解析した結果得られる各単語に付与される
各単語の重要性を表す値である。設定方法の詳細につい
ては後述する。文書得点計算手段４３において、文書得
点とは、登録文書と検索語の一致の程度を表す値であ
る。登録文書に付与されている各キーワードとの検索語
の一致度から計算される。ここで、一致度とは、登録文
書に付与されている各キーワードと検索語の一致の程度
を表す値である。検索語の各単語の重要度から計算され
るが、計算方法の詳細については後述する。文書得点の
計算方法は前述した方法（前記段落番号(００４２)〜
(００５５)）を用いる。FIG. 7 is a block diagram of the document score giving means in FIG. 6, in which 41 is a morpheme analyzing means, 42 is an importance setting means, 43 is a document score calculating means, and 44 is a matching degree calculating means. . The morpheme analysis means 41 morphologically analyzes the search word, divides it into words, and determines the part of speech for each word. In the document search device of the present invention, a compound word composed of a plurality of words can be used as a search word input by the user. In the degree-of-importance setting means 42, the degree of importance is a value representing the degree of importance of each word given to each word obtained as a result of morphological analysis of search words. Details of the setting method will be described later. In the document score calculation means 43, the document score is a value representing the degree of coincidence between the registered document and the search word. It is calculated from the matching degree of the search word with each keyword given to the registered document. Here, the degree of coincidence is a value representing the degree of coincidence between each keyword assigned to the registered document and the search word. It is calculated from the importance of each word of the search word, and details of the calculation method will be described later. The document score calculation method is the method described above (paragraph number (0042) to
(0055)) is used.

【００６４】以下に、重要度設定手段と一致度計算手段
について説明する。まず、重要度設定手段について説明
する。重要度設定時には、ユーザの入力した検索語は形
態素解析により単語に分割されている。ｎ（ｎ＞０）個
の単語から構成されている検索語Ｑをｑ₁…ｑ_nと書くこ
ととする。例えば、検索語「文書検索装置」は「文書」
「検索」「装置」の３語から構成されており、ｑ₁＝文
書、ｑ₂＝検索、ｑ₃＝装置となる。検索語に含まれる単
語ｑの重要度をｗ（ｑ）と書くこととする。本発明で
は、単語の重要度はつぎのように与えられる。・検索語の末尾の単語の重要度は、基本点αとする。・末尾以外の単語の重要度は、基本点に末尾からの距離
に位置係数βを乗じた値を加えた値とする。The importance degree setting means and the coincidence degree calculating means will be described below. First, the importance setting means will be described. When the importance level is set, the search word input by the user is divided into words by morphological analysis. A search word Q composed of n (n> 0) words is written as q ₁ ... Q _n . For example, the search term "document search device" is "document"
It is composed of three words, "search" and "device", and q ₁ = document, q ₂ = search, and q ₃ = device. The importance of the word q included in the search word is written as w (q). In the present invention, the importance of a word is given as follows. - tail of the importance of the word the end of the search terms, the basic point α. Importance of the word other than-end tail is a value obtained by adding a value obtained by multiplying the position coefficient β to the distance from the end tail base point.

【００６５】これを式で書くとつぎのようになる。ｗ（ｑｉ）＝α＋β＊（ｎ−ｉ） …（１）この方式では、従来技術で述べたように検索語を２回走
査する必要がなく、１回の走査で検索語の構成単語全て
に重要度を設定することができる。重要度設定を例で示
す。検索語を「新素材繊維開発」とする。この検索語は
「新」「素材」「繊維」「開発」の４単語に分割され
る。上式のパラメータを、α＝１０，β＝２とした場
合、各単語の重要度は、以下の表２２のようになる。When this is written as an expression, it becomes as follows. w (qi) = α + β * (n−i) (1) In this method, it is not necessary to scan the search word twice as described in the prior art, and all the constituent words of the search word can be scanned once. The degree of importance can be set. The importance setting is shown as an example. The search term is “new material fiber development”. This search term is divided into four words, “new”, “material”, “fiber”, and “development”. When the parameters of the above equation are α = 10 and β = 2, the importance of each word is as shown in Table 22 below.

【００６６】[0066]

【表２２】 [Table 22]

【００６７】前述の方式では、検索語の構成単語数が多
くなると、先頭に近い単語の重要度が高くなる一方なの
で、異なる検索語において先頭単語が同一の場合でも検
索語の構成単語数が多いほどその単語の重要度が高くな
ってしまうという問題がある。検索語の構成単語数に応
じたバイアスをかけることで、このような問題点を回避
できる。すなわち、構成単語数係数γを導入し、重要度
を設定する。ｗ（ｑｉ）＝α＋β＊（ｎ−ｉ）＋γ＊ｎ …
（２）とくに、γ＝−βとすれば、先頭単語の重要度が構成単
語数とは独立に、いつも同じ値にできる。先ほどの例で
用いた検索語「新素材繊維開発」に対し、パラメータ
を、α＝１２，β＝２，γ＝−２とした場合、各単語の
重要度は、以下の表２３のようになる。In the above-described method, as the number of constituent words of the search word increases, the importance of the word closer to the head increases, so that the number of constituent words of the search word is large even if the head words of different search words are the same. There is a problem that the degree of importance of the word becomes higher . By applying a bias corresponding to the number of constituent words of search words, avoid such problems
Can Ru. That is, the constituent word number coefficient γ is introduced to set the importance. w (qi) = α + β * (n−i) + γ * n ...
(2) Especially, if γ = −β, the importance of the first word can always be the same value independently of the number of constituent words. If the parameters are α = 12, β = 2, γ = -2 for the search term “new material fiber development” used in the previous example, the importance of each word is as shown in Table 23 below. Become.

【００６８】[0068]

【表２３】 [Table 23]

【００６９】前述の方法では、単語の性質に関わらず同
一の式で重要度を設定していた。しかし、単語の性質に
よって検索語として重要なものとそうでないものがあ
り、重要なものには高い重要度を与えることが望まれ
る。例えば、接頭辞などは補助的な役割を果たしている
ので名詞類と比較して一般的に重要度が低い。そこで、
単語の品詞に応じて重要度の設定パラメータ（α，β，
γ）を変えるようにした。例えば、名詞類(一般名詞,サ
変名詞など)に対するパラメータを、α[名詞]＝１２，
β[名詞]＝２，γ[名詞]＝−２，接頭辞に対するパラメ
ータを、α[接頭辞]＝４，β[接頭辞]＝０，γ[接頭辞]
＝０とする。このとき、検索語「新素材繊維開発」の各
単語の重要度は、以下の表２４のようになる。In the above-mentioned method, the importance is set by the same expression regardless of the nature of the word. However, some do not important as the search word by the nature of words, what is important is desired to provide a high importance. For example, prefixes are generally less important than nouns because they play an auxiliary role. Therefore,
Setting the parameters of importance depending on the part of speech of a word (α, β,
It was to change the γ). For example, parameters for nouns (general nouns, sahen nouns, etc.) are α [noun] = 12,
β [noun] = 2, γ [noun] =-2, parameters for the prefix are α [prefix] = 4, β [prefix] = 0, γ [prefix]
= 0. At this time, the importance of each word of the search word “new material fiber development” is as shown in Table 24 below.

【００７０】[0070]

【表２４】 [Table 24]

【００７１】前述の方法では、単語の品詞が同じであれ
ば同一の式で重要度を設定していた。しかし、検索用語
として重要か否かは品詞だけで決められるものではな
く、検索システムが対象とする文書の性質などに依存す
る。前述した実施例ではこのような品詞よりも細かい単
語の文法的／意味的な特徴を記述するものとしてキーワ
ード素性を提案している。例えば、繊維関係の文書検索
システムでは繊維に関する名詞は文書に頻出するので、
検索語としては一般的な名詞よりも重要性が低い。そこ
で、「繊維」という名詞に「複合語語基」というキーワ
ード素性を付与して、この単語を他の一般的な名詞から
識別する。そこで、単語の品詞だけでなくキーワード素
性に応じても重要度の設定パラメータ（α，β，γ）を
変えるようにした。例えば、名詞類に対するパラメータ
をキーワード素性「複合語語基」の有無によって、α
[名詞・素性あり]＝１２，β[名詞・素性あり]＝２，γ
[名詞・素性あり]＝−２，α[名詞・素性なし]＝１，β
[名詞・素性なし]＝１，γ[名詞・素性なし]＝−１とす
る。接頭辞に対するパラメータは先ほどと同じとすれ
ば、検索語「新素材繊維開発」の各単語の重要度は、以
下の表２５のようになる。In the above-mentioned method, if the word parts of speech are the same, the importance is set by the same formula. However, whether or not it is important as a search term is not determined only by the part of speech, but depends on the nature of the document targeted by the search system. In the above-mentioned embodiment, the keyword feature is proposed as a description of the grammatical / semantic features of a word smaller than the part of speech. For example, in a fiber-related document retrieval system, nouns related to fibers often appear in documents.
It is less important as a search term than general nouns. Therefore, a keyword feature “compound word base” is added to the noun “fiber” to distinguish this word from other general nouns. Therefore, even according to the keyword feature not only the part of speech of a word importance of setting parameters (alpha, beta, gamma) and to change the. For example, the parameter for the noun class is α depending on the presence or absence of the keyword feature “compound word base”.
[Noun / presence] = 12, β [Noun / presence] = 2, γ
[Nouns / features] =-2, α [Nouns / no features] = 1, β
[Noun / No Feature] = 1, γ [Noun / No Feature] =-1. Assuming that the parameters for the prefix are the same as before, the importance of each word of the search word "new material fiber development" is as shown in Table 25 below.

【００７２】[0072]

【表２５】 [Table 25]

【００７３】つぎに、一致度計算方式について説明す
る。一致度計算では文書に付与されているうちの１つの
キーワードと索引語の一致の程度を検索語の構成単語に
設定された重要度を用いて計算する。基本的には、キー
ワードと検索語の共通する構成単語に設定されている重
要度の合計をそのキーワードとその検索語の一致度と定
義する。例えば、「新素材繊維開発」を検索語とし、表
２５のように重要度が設定されたとする。ここで、「新
素材」、「新開発」、「合成繊維」の３語をキーワード
として一致度がいくつになるか計算する。Next, the coincidence calculation method will be described. In the matching degree calculation, the degree of matching between one of the keywords added to the document and the index word is calculated using the importance set for the constituent words of the search word. Basically, the sum of the degrees of importance set for the constituent words common to the keyword and the search word is defined as the degree of coincidence between the keyword and the search word. For example, it is assumed that "new material fiber development" is used as the search term and the importance is set as shown in Table 25. Here, using the three words “new material”, “new development”, and “synthetic fiber” as keywords, the degree of coincidence is calculated.

【００７４】１．キーワード：「新素材」（「新」「素
材」が構成単語）このとき、「新」「素材」の２単語が検索語と共通であ
る。一致度＝ｗ（新）＋ｗ（素材）＝４＋８＝１２２．キーワード：「繊維素材開発」（「繊維」「素材」
「開発」が構成単語）このとき、「繊維」「素材」「開発」の３単語が検索語
と共通である。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＝３＋８
＋４＝１５３．キーワード：「合成繊維販売」（「合成」「繊維」
「販売」が構成単語）このとき、「繊維」のみが検索語と共通である。一致度＝ｗ（繊維）＝３1. Keyword: “new material” (“new” and “material” are constituent words) At this time, two words “new” and “material” are common with the search word. Consistency = w (new) + w (material) = 4 + 8 = 12 2. Keywords: "fiber material development"("fiber""material"
“Development” is a constituent word) At this time, the three words “fiber”, “material”, and “development” are common to the search word. Consistency = w (fiber) + w (material) + w (development) = 3 + 8
+ 4 = 15 3. Keywords: "Synthetic fiber sales"("Synthetic""Fiber"
At this time, only “fiber” is common with the search word. Consistency = w (fiber) = 3

【００７５】前述の方法では、複数の単語が検索語とキ
ーワードに共通な場合、それら共通な単語の出現順序に
より異なるか否かの区別ができない。すなわち、検索語
「新素材繊維開発」に対し、キーワードが「素材繊維」
でも「繊維素材」でも一致度は同じになる。しかし、
「素材」「繊維」の出現順序は「繊維素材」と一致して
いるので、「繊維素材」より「素材繊維」の方が一致度
が大きくなるべきである。このため、検索語とキーワー
ドに共通な単語が複数ある場合、それらの単語の順序
（単語並び）が検索語とキーワードで一致する場合にボ
ーナス点を加えるようにした。ボーナス点（以下、「隣
接点」と呼ぶ）は単語並びの一致個数に比例するものと
し、単語並びあたりの隣接詞をδとする。δ＝３とする
と、先ほどと同じ検索語、キーワードに対する一致度は
つぎのようになる。In the above method, when a plurality of words are common to the search word and the keyword, it is impossible to distinguish whether or not they are different depending on the appearance order of the common words. In other words, the keyword is "material fiber" for the search term "new material fiber development".
However, the degree of coincidence is the same for "fiber materials". But,
Since the appearance order of the "material" and "fiber" is the same as that of the "fiber material", the "material fiber" should have a higher degree of coincidence than the "fiber material". Therefore, when there are a plurality of words common to the search word and the keyword, a bonus point is added when the order of the words (word arrangement) is the same in the search word and the keyword. The bonus point (hereinafter, referred to as “adjacent point”) is proportional to the number of matching word sequences, and the adjacency per word sequence is δ. When δ = 3, the degree of coincidence with the same search word and keyword as above is as follows.

【００７６】１．キーワード：「新素材」「新」「素材」の並びが共通である。一致度＝ｗ（新）＋ｗ（素材）＋δ＝４＋８＋３＝１５２．キーワード：「繊維素材開発」３単語が共通だが、単語並びが一致するものはない。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＝３＋８
＋４＝１２前述の方法では、検索語とキーワードが完全に一致した
場合と検索語がキーワードに含まれる場合を区別するこ
とができない。すなわち、検索語「新素材繊維開発」に
対し、キーワードが「新素材繊維開発」であっても「新
素材繊維開発センター」であっても一致度が同じになっ
てしまう。この問題点を解決するため、検索語とキーワ
ードの先頭の単語が一致した場合にδ先頭、検索語とキ
ーワードの末尾の単語が一致した場合に、δ[末尾]をボ
ーナス点として加えるようにした。δ[先頭]＝δ[末尾]
＝２とすると、先ほどと同じ検索語、キーワードに対す
る一致度はつぎのようになる。1. Keyword: The sequence of "new material", "new" and "material" is common. Matching degree = w (new) + w (material) + δ = 4 + 8 + 3 = 15 2. Keyword: “textile material development” Three words are common, but there is no one with the same word sequence. Consistency = w (fiber) + w (material) + w (development) = 3 + 8
+ 4 = 12 In the above method, it is not possible to distinguish between the case where the search word and the keyword completely match and the case where the search word is included in the keyword. That is, the degree of coincidence with the search term “new material fiber development” is the same whether the keyword is “new material fiber development” or “new material fiber development center”. In order to solve this problem, the top δ if the first word of the search words and keywords match, when the tail of the words the end of the search words and keywords match, the δ [end tail] as a bonus point I tried to add it. δ [top] = δ [youngest tail]
= 2, the degree of coincidence with the same search word and keyword as above is as follows.

【００７７】１．キーワード：「新素材」「新」「素材」の単語並びが共通で、「新」が検索語・
キーワードのどちらでも先頭にある。一致度＝ｗ（新）＋ｗ（素材）＋δ＋δ[先頭]＝４＋８
＋３＋２＝１７２．キーワード：「繊維素材開発」「開発」が検索語・キーワードのどちらでも末尾にあ
る。一致度＝ｗ（繊維）＋ｗ（素材）＋ｗ（開発）＋δ[末
尾] ＝３＋８＋４＋２＝１４1. Keyword: The word sequence of "new material", "new", and "material" is common, and "new" is the search term.
Both of the keywords are at the beginning. Matching rate = w (new) + w (material) + δ + δ [start] = 4 + 8
+ 3 + 2 = 17 2. Keywords: "fiber material development", "development" is at the tail end either of the search terms keywords. Consistency = w (fiber) + w (material) + w (development) + δ [ end ] = 3 + 8 + 4 + 2 = 14

【００７８】[0078]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）検索語を形態素解析し、その結果品詞分解された
単語と、文書中の品詞単位で保存されたキーワードを比
較することにより検索語と文書中の語が完全に一致して
いなくても検索することができる。（２）検索語と各文書中のキーワードとの一致度を計算
することにより、各文書に検索語に即した得点を付与す
ることができる。この場合の得点は、単語列の最後尾の
単語に基本点を付与し、単語列の前に遡るに従って点数
を上げていくので、単語列の前に位置する単語ほど高い
点数を与えることができる。（３）検索語と文書の一致度の計算について、キーワー
ド素性の１つである複合語語基、固有名詞構成語、地名
識別語または元号識別語を用いることにより、文書に得
点を付与する際にキーワードとなり得にくい語には高得
点を与えないようにすることができる。（４）検索語と文書の一致度の計算について、キーワー
ド素性の１つである接頭修飾を用いることにより、特殊
な意味をもつ接頭辞には得点を与えることができる。（５）一致度の計算に単語の並び順を考慮に入れること
で、一致度を正確に計算できる。（６）検索語の構成単語の位置によってその単語の重要
度が設定されるため、重要度設定が的確に行なえ、検索
精度が向上する。また、検索語の走査が１回で済むた
め、検索速度が向上する。このとき、キーワードの登録
文書中での出現位置または後続語によって重みつき一致
度および文書得点を計算するので、文書得点が従来と比
較して的確なものになる。（７）検索語とキーワードの構成単語の順序（単語並
び）が一致度に反映されるため、一致度計算が的確に行
なえ、検索精度語が向上する。また、一致度計算が和演
算のみなので検索速度が向上する。As is apparent from the above description, the present invention has the following effects. (1) search word a morphological analysis, the results and words are part of speech degraded, word search term and the document by comparing the keywords stored in the part of speech units in the document is not completely match You can also search. (2) by calculating the degree of coincidence between keywords search words and in each document, can be given a score that meets the search term in each document. In this case, the basic point is given to the last word in the word string, and the score is increased as it goes back to the front of the word string. Therefore, the word located before the word string can be given a higher score. . (3) Regarding the calculation of the degree of coincidence between the search word and the document, a compound word base that is one of the keyword features , a proper noun constituent word, and a place name
By using the identification word or the era identification word , it is possible to avoid giving a high score to a word that is unlikely to be a keyword when giving a score to a document. (4) For the calculation of the degree of coincidence between the search word and the document, by using the prefix modification, which is one of the keyword features, it is possible to give a score to a prefix having a special meaning. (5) The degree of coincidence can be accurately calculated by taking the word arrangement order into consideration in calculating the degree of coincidence. (6) Since the importance of the word is set according to the position of the constituent word of the search word, the importance can be set accurately and the search accuracy is improved. In addition, the search speed is improved because the search word is scanned once. At this time, since the weighted coincidence degree and the document score are calculated based on the appearance position of the keyword in the registered document or the subsequent word, the document score is more accurate than the conventional one. (7) Since the order (word sequence) of the constituent words of the search word and the keyword is reflected in the degree of coincidence, the degree of coincidence calculation can be accurately performed and the search accuracy word is improved. Further, since the coincidence degree calculation is only a sum operation, the search speed is improved.

[Brief description of drawings]

【図１】本発明による文書検索方式の一実施例を説明
するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a document search system according to the present invention.

【図２】図１における文書得点付与手段の動作を説明
するためのフローチャートである。FIG. 2 is a flowchart for explaining the operation of the document score giving means in FIG.

【図３】本発明による検索語に対する重要付与ルール
を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining a rule of giving importance to a search word according to the present invention.

【図４】本発明による文書検索方式の他の実施例を説
明するための構成図である。FIG. 4 is a configuration diagram for explaining another embodiment of the document search system according to the present invention.

【図５】図４における文書得点付与手段を構成図であ
る。5 is a block diagram of the document score giving means in FIG.

【図６】本発明による文書検索方式の更に他の実施例
を説明するための構成図である。FIG. 6 is a configuration diagram for explaining still another embodiment of the document search system according to the present invention.

【図７】図６における文書得点付与手段の構成図であ
る。7 is a configuration diagram of the document score giving means in FIG.

[Explanation of symbols]

１…検索語入力手段、２…文書得点付与手段、３…文書
ランキング手段、４…文書出力手段、５…キーワードか
ら付与された文書。1 ... Search word input means, 2 ... Document score giving means, 3 ... Document ranking means, 4 ... Document output means, 5 ... Documents given from keywords.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平３−116377（ＪＰ，Ａ) 今郷詔，短単位キーワードを用いた文書ファイリングシステム，情報処理学会第43回全国大会講演論文集（４），1991 年，ｐ．４−215〜ｐ．４−216 望主雅子，日本語形態素解析における素性の導入，情報処理学会第43回全国大会（３），1991年，ｐ．３−111〜ｐ. ３−112 中村，杉山，部分一致検索におけるマッチング技報について，昭和55年度電子通信学会総合全国大会講演論文集，日本，社団法人電子通信学会，1980年３月，ｐ．５−263 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-3-116377 (JP, A) Satoshi Imago, Document filing system using short unit keywords, Proc. Of the 43rd National Convention of IPSJ ( 4), 1991, p. 4-215 to p. 4-216 Masako Mouji, Introduction of features in Japanese morphological analysis, IPSJ 43rd Annual Meeting (3), 1991, p. 3-111 to p. 3-112 Nakamura, Sugiyama, On the Matching Technical Report on Partial Matching Search, Proc. p. 5-263 (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 17/30 JISST file (JOIS)

Claims

(57) [Claims]

1. A morpheme analyzing means for morphologically analyzing an input search word to divide it into a word string having a part of speech, and a keyword feature of the word string obtained by the morpheme analyzing means.
For each constituent word given with, go back to the front of the word string.
Therefore, increasing the points from the basic point in order
Is set and the keyword features of the word string are assigned
For each constituent word, follow backward in the word sequence.
And set the weight for each constituent word to which the keyword feature is given.
The points will be gradually increased from the total score of the importance
Thus, the degree of importance is set, and the degree-of-importance setting means for setting the degree of importance to each of the constituent words of the search word, each of the constituent words of the word string constituting the keyword given to the registered document, and the search word When a match is found between the constituent words of the word string that constitutes the, the matching score calculation means for calculating the matching score for each keyword from the importance set in the constituent words of this search word, and the document score is calculated from the matching score for each keyword. A document retrieval method comprising: a document score calculation means for calculating each document; and a document output means for outputting information about the document in the order of the document scores.

2. The importance level setting means, when the keyword feature is a compound word base, a keyword in the word string.
Document retrieval system according to claim 1, characterized in that less than constituent word of importance not to de identity is assigned.

3. The key in the word string when the keyword feature is a proper noun constituent word.
Document retrieval system according to claim 1, wherein the small <br/> Kusuru it from constituent word of importance not a word identity is assigned.

4. The importance level setting means, if the keyword feature is prefix modification, a keyword in the word string.
Document retrieval system according to claim 1, characterized in that less than constituent word of importance not granted the identity.

5. The importance degree setting means, when the keyword feature is a place name identification word, a keyword in the word string.
Document retrieval system according to claim 1, characterized in that less than constituent word of importance not to de identity is assigned.

6. The importance level setting means, when the keyword feature is an era identification word, a keyword in the word string.
Document retrieval system according to claim 1, characterized in that less than constituent word of importance not to de identity is assigned.

7. The matching degree calculation means increases the matching degree in accordance with the number of constituent words that match in the arrangement order of the constituent words of the keyword and the arrangement order of the constituent words of the search word. Document search method described in 1.

8. The document retrieval system according to claim 1, wherein the coincidence degree calculating means increases the coincidence degree when the keyword and the constituent word at the end of the search term coincide with each other.

9. The document retrieval system according to claim 1, wherein the coincidence degree calculating means increases the degree of coincidence when the keyword and the first constituent word of the search term coincide with each other.

10. The document retrieval method according to claim 1, wherein the document score calculation means calculates the document score by weighting the matching degree of the keyword according to the appearance position of the keyword in the document.

11. The document retrieval method according to claim 1, wherein the document score calculation means calculates the document score by weighting the matching degree of the keyword according to the subsequent word of the keyword.