JP2715981B2

JP2715981B2 - Search result evaluation device

Info

Publication number: JP2715981B2
Application number: JP7098337A
Authority: JP
Inventors: 加奈子久保; 幹也谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1995-04-24
Filing date: 1995-04-24
Publication date: 1998-02-18
Anticipated expiration: 2013-02-18
Also published as: JPH08292960A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、データベース、特に文
献データベース検索システムで検索結果を評価する方式
において、データベース中の各レコードに適切な索引語
とその重みを付与し、検索者の要求に適合している順に
検索結果を出力する検索結果評価装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for evaluating a search result in a database, especially a document database search system, in which each record in the database is given an appropriate index word and its weight to meet a searcher's request. The present invention relates to a search result evaluation device that outputs search results in the order in which they are performed.

【０００２】[0002]

【従来の技術】従来、データベース、特に文献データベ
ース検索システムで検索を行う際、検索者の要求を表現
した検索式などを用いてデータベース内の各レコードの
検索を行っていた。そして、検索結果は検索式に完全に
合致したレコードが、何の順序つけもされずに出力され
ていた。そのため、検索者はもっとも要求を満足させる
レコードを探すために検索結果中の全てのレコードに目
を通す必要があった。2. Description of the Related Art Conventionally, when performing a search in a database, especially a document database search system, each record in the database has been searched using a search formula expressing a request of a searcher. Then, the search results output records that completely matched the search expression without any ordering. Therefore, the searcher had to go through all the records in the search result in order to find the record that satisfied the request most.

【０００３】この問題を解決するために、従来から１つ
のレコードにおける検索語の重みを用いて検索結果を検
索者の適合度順に出力する方式が提案されている。[0003] In order to solve this problem, there has been proposed a method of outputting search results in the order of the relevance of a searcher using the weight of a search word in one record.

【０００４】例えば、特開平１−１４９１２７号公報
「情報検索装置」（以下、文献１）に記載の発明では検
索語の重みを決定する要因の一つとして、検索語がレコ
ード中に何回出現しているかという出現頻度を挙げてい
る。For example, in the invention described in Japanese Patent Application Laid-Open No. 1-149127, "Information Retrieval Apparatus" (hereinafter referred to as Reference 1), one of the factors for determining the weight of a search term is how many times the search term appears in a record. The frequency of appearance of whether or not they do.

【０００５】また、特開平６−８９２１１号公報「デー
タベース装置およびデータ管理方法」（以下、文献２）
および特開平４−２６２４６０号公報「情報検索装置」
（以下、文献３）に記載の発明においても、検索語の重
みとしてレコード中での出現頻度を用いている。[0005] Japanese Patent Application Laid-Open No. 6-89211, "Database Apparatus and Data Management Method" (hereinafter referred to as Document 2)
And JP-A-4-262460, "Information Searching Device"
Also, in the invention described in (Reference 3), the appearance frequency in a record is used as the weight of the search term.

【０００６】特開平４−２８１５６５号公報「文書検索
装置」（以下、文献４）に記載の発明においてはフルテ
キスト検索において検索語がテキストがどの部分に出現
しているかによって重みを変えている。例えばタイトル
中に出現していれば１０、前書き部分に出現していれば
２のように重みは変わっている。また、特開平３−２９
４９６３号公報「文書検索装置」（以下、文献５）に記
載の発明においても、検索語が出現している部分に応じ
て重みを付けている。In the invention described in Japanese Patent Application Laid-Open No. Hei 4-281565, "Document Retrieval Apparatus" (hereinafter referred to as Reference 4), in a full-text search, the weight is changed depending on where a search term appears in a text. For example, the weight changes as 10 if it appears in the title and 2 if it appears in the preamble. Also, JP-A-3-29
Also in the invention described in Japanese Patent No. 4963, “Document Searching Apparatus” (hereinafter referred to as Document 5), weights are assigned according to parts where search words appear.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上述し
た文献１から５に記載されている各方式では、データベ
ース内のレコードの長さを考慮していない。例えば、あ
る検索語が同じ１回の出現頻度でも、２０字程度のレコ
ードと、２０００字のレコードでは前者の場合のほうが
ある検索語の重要度は高いと考えられるが、従来の方式
では、このような差は重み付けにおいて反映されない。However, in each of the systems described in the above-mentioned documents 1 to 5, the length of the record in the database is not considered. For example, even if a certain search word has the same frequency of occurrence once, it is considered that in the case of a record of about 20 characters and a record of 2,000 characters, the former search word is more important in the former case. Such differences are not reflected in the weighting.

【０００８】また、特に文献１から３に記載の方式では
データベース中で語が出現しているレコード件数を考慮
していない。出現しているレコード件数が多い語、例え
ば情報処理技術関係のデータベースにおける「情報」の
ような語はある特定のレコードを特徴付けるものではな
く、重要度も低いが、これらの方式によると「情報」と
いう語も高い重みをもつことになる。In particular, the methods described in Documents 1 to 3 do not consider the number of records in which a word appears in a database. Words with a large number of records appearing, such as "information" in databases related to information processing technology, do not characterize a particular record and are of low importance, but according to these methods, "information" Also has a high weight.

【０００９】また、特に文献４から５に記載の方式にお
いては、検索語の出現する項目により重みを変えている
が、例えば、データベース内の項目が、タイトル、抄
録、本文からなるデータベースを検索する場合、タイト
ルに出現しているレコード件数が１００件ある検索語と
抄録に出現しているレコード件数が１００件ある検索語
の差異については考慮されていない。一般に、抄録はタ
イトルよりもテキストが長い分、出現する語も多くな
り、データベース中で出現するレコード件数も多くなる
と考えられ、語の重要度も低くなると推定される。In the methods described in Documents 4 and 5, in particular, the weight is changed depending on the item in which the search term appears. For example, a database in which the item in the database includes a title, an abstract, and a text is searched. In this case, a difference between a search word having 100 records in the title and a search word having 100 records in the abstract is not considered. In general, an abstract is considered to have more words as the text is longer than the title, and more records to appear in the database, and it is estimated that the importance of the words will be lower.

【００１０】このように従来の検索語の重みによってレ
コードの適合度を計算する方式においては、検索語の重
み付けに関して解決すべき課題がある。As described above, in the conventional method of calculating the relevance of a record based on the weight of a search word, there is a problem to be solved regarding the weight of the search word.

【００１１】本発明の目的は、上述の問題点を解決し、
データベース中の各レコードの各項目の長さと各レコー
ドの項目別の検索語の出現頻度と検索語の項目別の出現
レコード件数を考慮して検索語の重みを決定し、検索者
にとってより適切なレコードの適合度を算出する検索結
果評価装置を提供することである。An object of the present invention is to solve the above-mentioned problems,
Determine the weight of the search term in consideration of the length of each item of each record in the database, the appearance frequency of the search term for each item of each record, and the number of records appearing for each item of the search term, and more appropriate for the searcher An object of the present invention is to provide a search result evaluation device for calculating the relevance of a record.

【００１２】[0012]

【課題を解決するための手段】本発明、第１の発明は、
複数のレコードと項目からなるデータベースと、検索者
が検索項目と検索語を入力する検索要求と、前記検索要
求から検索式を生成する検索式生成部と、前記検索式を
用いて前記データベースの検索を行う検索実行部とを備
え、前記検索の結果を前記検索者の適合度順に出力する
検索結果評価装置において、前記データベース内の各レ
コードに出現している語を抽出し、前記抽出された語の
前記レコードと前記項目を特定し、さらに出現頻度を参
照するレコード項目別出現頻度参照部と、前記複数のレ
コード各々の項目の長さである項目長を参照するレコー
ド別項目長参照部と、前記抽出された語がデータベース
内に出現しているレコード件数を項目別に参照する語別
出現レコード数参照部と、前記出現頻度と前記項目長と
前記レコード件数をもとに前記抽出された語の重みを算
出し、前記抽出された語のレコードと項目と重みを特定
した重みつき索引語ファイルを生成する索引語重み算出
部と、前記検索者が入力した検索語の重みを、前記重み
つき索引語ファイルから受け取り、前記検索実行部で得
られた検索結果の重みを計算する検索語重み算出部と、
前記検索結果の重みを参照して前記検索結果であるレコ
ードを適合度順にソートした適合度順検索結果を出力す
るレコード適合度算出部を備えることを特徴とする。Means for Solving the Problems The present invention, the first invention,
A database including a plurality of records and items, a search request for a searcher to input a search item and a search word, a search expression generation unit configured to generate a search expression from the search request, and searching the database using the search expression A search execution unit that performs a search, and outputs a search result in the order of the relevance of the searcher. In the search result evaluation device, a word appearing in each record in the database is extracted, and the extracted word is extracted. The record and the item are identified, and a record item-specific appearance frequency reference unit that further refers to an appearance frequency, and a record-specific item length reference unit that refers to an item length that is the length of each of the plurality of records, A word-based occurrence record number reference unit for referring to the number of records in which the extracted word appears in the database by item; the occurrence frequency, the item length, and the number of records An index word weight calculation unit that calculates a weight of the extracted word based on the extracted word, generates a weighted index word file specifying the record, item, and weight of the extracted word; and a search input by the searcher. A search word weight calculation unit that receives a word weight from the weighted index word file and calculates the weight of the search result obtained by the search execution unit;
A record relevance calculating unit that outputs a relevance order search result obtained by sorting the records that are the search results in relevance order with reference to the weight of the search result is provided.

【００１３】また、第２の発明は、第１の発明におい
て、前記索引語重み算出部が、前記抽出された語を選別
するしきい値を保持し、前記重みを算出する際に、前記
しきい値以上の重みを持つ抽出された語を索引語とし、
前記索引語のレコードと項目と重みを特定した重みつき
索引語ファイルを生成することを特徴とする。In a second aspect based on the first aspect, the index term weight calculator holds a threshold value for selecting the extracted word, and calculates the weight when calculating the weight. An extracted word having a weight equal to or greater than the threshold is used as an index word,
It is characterized in that a weighted index word file that specifies the record, item and weight of the index word is generated.

【００１４】さらに、第３の発明は、第１、第２の発明
において、前記索引語重み算出部が、前記重みを前記出
現頻度に前記レコード件数の逆数を乗じ、前記項目長で
除することで算出することを特徴とする。In a third aspect based on the first and second aspects, the index word weight calculator multiplies the weight by the reciprocal of the number of records and divides the weight by the item length. Is calculated.

【００１５】さらに、第４の発明は、第１、第２、第３
の発明において、前記レコード項目別出現頻度参照部
が、語を抽出する際に自然言語解析を用いることを特徴
とする。[0015] Further, a fourth aspect of the present invention provides the first, second, and third aspects.
The present invention is characterized in that the record item-specific appearance frequency reference unit uses natural language analysis when extracting words.

【００１６】[0016]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１７】図１は本発明の一実施例の構成を示すブロ
ック図である。図２は図１中のレコード項目別出現頻度
参照部で得られる結果の一例を示す図である。図３は図
１中のレコード別項目長参照部で得られる結果の一例を
示す図である。図４は図１中の語別出現レコード数参照
部で得られる結果の一例を示す図である。図５は図１中
の重み付き索引語ファイルの一例を示す説明図である。FIG. 1 is a block diagram showing the configuration of one embodiment of the present invention. FIG. 2 is a diagram showing an example of a result obtained by the appearance frequency reference unit for each record item in FIG. FIG. 3 is a diagram showing an example of a result obtained in the item length reference unit for each record in FIG. FIG. 4 is a diagram illustrating an example of a result obtained by the word-based occurrence record number reference unit in FIG. FIG. 5 is an explanatory diagram showing an example of the weighted index word file in FIG.

【００１８】図１において、データベース１０１は検索
対象のデータベースである。レコード項目別出現頻度参
照部１０２はデータベース１０１中の各レコードに出現
している語を項目別に抽出し、それぞれの語がその項目
において何回出現しているかという出現頻度を参照す
る。本実施例のデータベースは、図書館情報学関係のデ
ータベースとし、項目として、「タイトル」、「抄
録」、「本文」の３つの項目があるものとする。もちろ
ん本発明は、本実施例に限るものではなく、様々なデー
タベースに適用可能である。In FIG. 1, a database 101 is a database to be searched. The record item appearance frequency reference unit 102 extracts words appearing in each record in the database 101 for each item, and refers to the appearance frequency indicating how many times each word appears in the item. The database of the present embodiment is a database related to library and information science, and has three items of “title”, “abstract”, and “text”. Of course, the present invention is not limited to the present embodiment, but is applicable to various databases.

【００１９】また、本実施例での語の抽出にあたって
は、タイトル、抄録、本文などのテキストで表現されて
いる項目に対して、例えば、「自然言語処理の基礎技
術」（野村浩郷著、電子情報通信学会発行、１９８８
年）の第１章、第２章に記載されている自然言語解析方
式などを用いて自然言語解析を行い、語切りを行う。切
り出された語のうち、活用語尾、助動詞、連体助詞、終
助詞、副助詞、格助詞、並列助詞をのぞいた語を抽出す
るものとするが、これはあくまで一例であり、他に語を
抽出する技術として、検索するデータベースの種類に応
じて予めキーワードを記したテーブルを別途保持してお
き、そのテーブルを参照することにより語を抽出する方
法や、検索者の検索要求１０７内の検索語に関しての
み、データベース内の各項目に対して出現頻度を参照す
る方法など様々な技術が考えられる。In extracting words in this embodiment, for example, items described in text such as titles, abstracts, and texts may be referred to, for example, in "Basic Technology of Natural Language Processing" (by Hirogo Nomura, Published by IEICE, 1988
The natural language analysis is performed by using the natural language analysis method described in Chapter 1 and Chapter 2 of (Year), and the word is segmented. Of the extracted words, words that exclude conjugative endings, auxiliary verbs, adjunct particles, final particles, accessory particles, case particles, and parallel particles are extracted, but this is only an example and other words are extracted. As a technique for performing the search, a table in which keywords are described in advance according to the type of the database to be searched is separately held, and a method of extracting words by referring to the table or a method of searching for a search word in the search request 107 Only various techniques such as a method of referring to the appearance frequency for each item in the database can be considered.

【００２０】レコード別項目長参照部１０３は、データ
ベース１０１中の各レコードの項目の長さを参照する。The item-by-record item length reference unit 103 refers to the item length of each record in the database 101.

【００２１】語別出現レコード数参照部１０４はレコー
ド項目別出現頻度参照部１０２で抽出された語がデータ
ベース１０１中で全体のレコード中、何件のレコードに
出現しているかというレコード件数を項目別に参照す
る。The word-specific appearance record number reference unit 104 calculates, for each item, the number of records in which the words extracted by the record item-specific appearance frequency reference unit 102 appear in the database 101 in all records. refer.

【００２２】索引語重み算出部１０５は、レコード項目
別出現頻度参照部１０２で抽出された語の重みを算出
し、あるしきい値以上の重みを持つ語を索引語とする。
それぞれの語の重みは、その語がそのレコードの内容に
どのくらい寄与しているかという重要度を示す値であ
り、この値が大きいほど重要度は高い。また、しきい値
は、それぞれの語の重みが算出された後で、重要度が低
い語を排除するのに妥当な値をしきい値としてもよい
し、予めしきい値として定めておいてもかまわない。The index word weight calculation unit 105 calculates the weight of the word extracted by the record item-specific appearance frequency reference unit 102, and determines a word having a weight equal to or greater than a certain threshold value as an index word.
The weight of each word is a value indicating the degree of importance that the word contributes to the contents of the record, and the greater the value, the higher the importance. In addition, the threshold value may be set to a value that is appropriate for eliminating words with low importance after the weight of each word is calculated, or may be set in advance as the threshold value. It doesn't matter.

【００２３】この重みは、索引語についてレコード項目
別出願頻度参照部１０２で出現頻度を、語別出現レコー
ド数参照部１０４でレコード件数を、レコード別項目長
参照部１０３で項目長を参照し、重み＝出現頻度×レコード件数の逆数／項目長・・・（１）で決定する。The weight is determined by referring to the frequency of occurrence of the index term in the application frequency reference unit 102 for each record item, the number of records in the number-of-appearance records reference unit 104, and the item length in the item length reference unit 103 for each record. Weight = frequency of appearance × reciprocal number of record / item length (1)

【００２４】重み付き索引語ファイル１０６は索引語と
索引語重み算出部１０５で算出された重みをデータベー
ス１０１中の各レコードの項目別に記載したファイルで
ある。また、重み付き索引語ファイルに関しては、詳し
く後述する。The weighted index word file 106 is a file in which the index words and the weights calculated by the index word weight calculator 105 are described for each record item in the database 101. The weighted index word file will be described later in detail.

【００２５】検索要求１０７は、検索者が入力した検索
項目と検索語である。The search request 107 is a search item and a search word input by the searcher.

【００２６】検索式生成部１０８は検索要求１０７から
検索式を生成する。検索要求１０７で特に検索語と検索
語の間の関係が指定されている場合はその関係を用い
る。指定されない場合は全ての検索語をＯＲで連結す
る。また、検索式生成部１０８にシソーラスや同義語辞
書などが具備されているときには、検索者の入力した検
索語の同義語、異表記語、上位語、下位語などをＯＲで
連結して検索式を生成する。The search formula generation unit 108 generates a search formula from the search request 107. If the search request 107 particularly specifies a relationship between search words, the relationship is used. If not specified, all search terms are connected by OR. When the search formula generation unit 108 is equipped with a thesaurus, a synonym dictionary, or the like, the search formula input by the searcher is combined with a synonym, a different notation, an upper word, a lower word, etc. by OR, and the search formula is obtained. Generate

【００２７】検索実行部１０９は、検索式生成部１０８
が生成した検索式を実行し、データベース１０１から検
索結果であるレコードをとってくる。The search execution unit 109 includes a search expression generation unit 108
Executes the search formula generated by the server and retrieves a record as a search result from the database 101.

【００２８】検索語重み算出部１１０は重み付き索引語
ファイル１０６を参照して、検索実行部１０９で得られ
た検索結果の各レコードにおける検索要求１０７の検索
語の重みを得る。The search term weight calculation unit 110 refers to the weighted index word file 106 to obtain the weight of the search term of the search request 107 in each record of the search result obtained by the search execution unit 109.

【００２９】レコード適合度算出部１１１は、検索語重
み算出部１１０で算出された重みを加算して、各レコー
ドがどの程度検索者の要求に適合しているかを示す適合
度を算出し、その適合度の順にレコードをソートして適
合度順検索結果１１２を出力する。The record matching degree calculation section 111 adds the weights calculated by the search word weight calculation section 110 to calculate a matching degree indicating how much each record matches the searcher's request. The records are sorted in the order of the relevance, and the relevance order search result 112 is output.

【００３０】ここで一例を挙げて説明する。前述した通
り検索対象は図書館情報学関係の文献データベースで、
１００００件のレコードと、タイトル、抄録、本文の項
目があるものとする。Here, an example will be described. As mentioned above, the search target is a library database related to library and information science.
It is assumed that there are 10,000 records and title, abstract, and text items.

【００３１】まず、索引語重み算出部１０５では、レコ
ード項目別出現頻度参照部１０２において、データベー
ス１０１の各レコードに出現している語を項目別に抽出
された語と、その語の各レコードの各項目においての出
現頻度を得る。First, in the index word weight calculation unit 105, the word appearing in each record of the database 101 is extracted by the record item appearance frequency referring unit 102 by item, and each of the records of the record of the word is extracted. Get the frequency of appearance in an item.

【００３２】本実施例でのレコード項目別出現頻度参照
部１０２は、各レコードのタイトル、抄録、本文の項目
を自然言語解析して、活用語尾、助動詞、連体助詞、終
助詞、副助詞、格助詞、並列助詞、を除いた語を抽出す
る。さらに、項目別に何回出現しているかという出現頻
度を数える。図２はレコード項目別出現頻度参照部１０
２が参照した結果の一例を示す図である。例えば＃１レ
コードの「タイトル」項目において「大学図書館」とい
う語は１回出現している。同じく＃１レコードの「抄
録」項目においては３回出現し、「本文」項目において
は１４９回出現していることを示している。The record item-specific appearance frequency reference unit 102 in this embodiment analyzes the title, abstract, and body items of each record by natural language analysis, and uses inflected endings, auxiliary verbs, adjunct particles, final particles, adjunct particles, and cases. Extract words excluding particles and parallel particles. Further, the number of appearances of how many times each item appears is counted. FIG. 2 shows an appearance frequency reference unit 10 for each record item.
FIG. 2 is a diagram showing an example of a result referred to by No. 2; For example, the word "university library" appears once in the "title" item of the # 1 record. Similarly, in the “abstract” item of the # 1 record, it appears three times, and in the “body” item, it appears 149 times.

【００３３】次に、索引語重み算出部１０５はレコード
別項目長参照部１０３で、データベース１０１中の各レ
コードの各項目の長さを調べる。図３はレコード別項目
長参照部１０３が参照した結果の一例を示す図であり、
この図によると、＃１レコードの「タイトル」項目の長
さは７６、「抄録」項目の長さは２６４、本文の長さは
２４３９８であることがわかる。Next, the index word weight calculation unit 105 checks the length of each item of each record in the database 101 by the item length reference unit 103 for each record. FIG. 3 is a diagram showing an example of a result referred by the record-specific item length reference unit 103.
According to this figure, the length of the “title” item of the # 1 record is 76, the length of the “abstract” item is 264, and the length of the body is 24398.

【００３４】索引語重み算出部１０５はさらに語別出現
レコード数参照部１０４で、レコード項目別出現頻度参
照部１０２で抽出された語がデータベース１０１中で出
現しているレコード件数を得る。図４は語別出現レコー
ド数参照部１０４が参照した結果の一例を示す図であ
る。これによるとデータベース１０１中の全レコード１
００００件中で、「大学図書館」という語が「タイト
ル」項目に出現しているレコードはそのうちの６３件で
あり、「抄録」項目では９７件であることを示してい
る。The index word weight calculation unit 105 further obtains the number of records in which the words extracted by the record item appearance frequency reference unit 102 appear in the database 101 by the word appearance record number reference unit 104. FIG. 4 is a diagram illustrating an example of a result referred by the word appearance record number reference unit 104. According to this, all records 1 in the database 101
Of the 0000 records, 63 records have the word "university library" appearing in the "title" item, and 97 records in the "abstract" item.

【００３５】以上のデータをもとに、索引語重み算出部
１０５は上述した（１）式によって語の重みを算出す
る。例えば＃１レコードの「タイトル」項目中に出現す
る「大学図書館」の重みは（１／７６）×（１００００／６３）＝２．０８９となる。また、＃１レコードの「抄録」項目中の「大学
図書館」の重みは（３／２６４）×（１００００／９７）＝１．１７２となる。Based on the above data, the index word weight calculation unit 105 calculates the weight of the word according to the above equation (1). For example, the weight of “university library” appearing in the “title” item of the # 1 record is (1/76) × (10000/63) = 2.089. The weight of “university library” in the “abstract” item of the # 1 record is (3/264) × (10000/97) = 1.172.

【００３６】最後に索引語重み算出部１０５は重み付き
索引語ファイル１０６に、データベース１０１に出現す
る語の中でしきい値以上の重みを持つ語を索引語として
その重みとともに記載する。ここで、重み付き索引語フ
ァイル１０６の一例を図５に示す。この例では、しきい
値を０．０５とし、重みが０．０５以下の語に関しては
記載されていない。また、しきい値は自由に設定できる
ものとしてもよいし、予め定めておいてもかまわない。Finally, the index word weight calculation unit 105 describes, in the weighted index word file 106, words having a weight equal to or greater than a threshold value among words appearing in the database 101 together with their weights as index words. Here, an example of the weighted index word file 106 is shown in FIG. In this example, the threshold value is set to 0.05, and a word having a weight of 0.05 or less is not described. In addition, the threshold value may be freely set or may be set in advance.

【００３７】ここで、検索者は「大学図書館におけるネ
ットワーク」に関する文献を要求したものとし、最初に
検索要求１０７の検索語に「大学図書館」「ネットワー
ク」、検索項目として「タイトル」「抄録」が入力さ
れ、検索者は、タイトルまたは抄録の項目に、「大学図
書館」と「ネットワーク」が両方記載されているレコー
ドを検索するときの例をあげる。Here, it is assumed that the searcher has requested a document relating to "network in a university library". First, the search request 107 includes "university library" and "network" as search terms, and "title" and "abstract" as search items. The searcher will be given an example of searching for a record in which "University Library" and "Network" are both listed in the title or abstract item.

【００３８】検索式生成部１０８は検索要求１０７を、
検索式に変換する。検索者が検索語間の関係を明示して
いる場合、すなわち「大学図書館ＡＮＤネットワー
ク」のように入力されている場合はその関係をそのまま
用いる。明示されていない場合は、すべての検索語をＯ
Ｒで連結する。シソーラスあるいは同義語辞書などを利
用できる場合は、検索語について同義語、異表記語、類
義語などをＯＲで追加してもよい。The search formula generation unit 108 generates a search request 107
Convert to search expression. If the searcher specifies the relationship between the search terms, that is, if the search term is entered as “University Library AND Network”, the relationship is used as it is. If not specified, all search terms are
Connect with R. When a thesaurus or a synonym dictionary can be used, a synonym, a different notation, a synonym, or the like may be added to the search word by OR.

【００３９】検索実行部１０９は検索式生成部１０８で
生成された検索式を実行して、データベース１０１から
検索結果を受けとる。本例では、１００００件レコード
があるうち、上記条件式を満たしたのは、＃１〜＃３の
レコードであったものとする。The search execution unit 109 executes the search formula generated by the search formula generation unit 108 and receives a search result from the database 101. In this example, it is assumed that among the 10000 records, the records satisfying the above conditional expression are the records # 1 to # 3.

【００４０】検索語重み算出部１１０は検索語「大学図
書館」「ネットワーク」が検索結果の各レコードの検索
項目「タイトル」「抄録」においてどれくらいの重みを
もつかを、図５に記載した索引語ファイル１０６を参照
して算出する。The search term weight calculation unit 110 determines the weight of the search terms “university library” and “network” in the search items “title” and “abstract” of each record of the search result by using an index term described in FIG. The calculation is performed with reference to the file 106.

【００４１】まず、検索語重み算出部１１０は検索結果
のレコード番号、検索項目、検索語から図５の索引語フ
ァイル１０６をしらべて、その重みを参照する。ここ
で、検索結果にレコード番号＃１のレコードが含まれて
いたとする。＃１の「タイトル」項目における「大学図
書館」の重みは２．０８９であり、同じく＃１の「抄
録」項目における重みは１．１７２である。また、＃１
の「タイトル」項目における「ネットワーク」の重みは
２．６５１であり、「抄録」項目における重みは３．１
５７である。First, the search word weight calculation unit 110 searches the index word file 106 of FIG. 5 from the record number, search item, and search word of the search result, and refers to the weight. Here, it is assumed that the record of record number # 1 is included in the search result. The weight of “university library” in the “title” item of # 1 is 2.089, and the weight of the “abstract” item of # 1 is 1.172. Also, # 1
The weight of “network” in the “title” item is 2.651, and the weight in the “abstract” item is 3.1.
57.

【００４２】レコード適合度算出部１１１は、検索語重
み算出部１１０で出力された検索語「大学図書館」「ネ
ットワーク」のそれぞれのレコードにおける重みを加算
して、レコードの適合度を計算する。例えば＃１のレコ
ードにおいては、「タイトル」と「抄録」の項目におけ
る「大学図書館」「ネットワーク」の重みを加算する
と、「大学図書館」：２．０８９＋１．１７２＝３．２
６１、「ネットワーク」：２．６５１＋３．１５７＝
５．８０８であり、これらを合計すると、３．２６１＋
５．８０８＝９．０６９となる。この９．０６９が＃１
レコードの適合度となる。The record matching degree calculation unit 111 calculates the matching degree of the record by adding the weights of the records of the search words “college library” and “network” output by the search word weight calculation unit 110. For example, in the record of # 1, when the weights of “university library” and “network” in the items of “title” and “abstract” are added, “university library”: 2.089 + 1.172 = 3.2
61, "network": 2.651 + 3.157 =
5.808, which together add 3.261+
5.808 = 9.069. This 9.069 is # 1
The relevance of the record.

【００４３】同様にして＃２のレコードにおける「大学
図書館」の重みは、０．９６８＋０．５４３＝１．５１
１で、「ネットワーク」の重みは、０．４８７となる。
従ってこれらの値を合計した１．５１１＋０．４８７＝
１．９９８が＃２の適合度となる。同様に＃３の「大学
図書館」の重みは０．４６９、「ネットワーク」の重み
は、７．７０１、よって＃３の適合度は８．１７０とな
る。Similarly, the weight of “college library” in the record of # 2 is 0.968 + 0.543 = 1.51
At 1, the weight of "network" is 0.487.
Therefore, the sum of these values is 1.511 + 0.487 =
1.998 is the fitness of # 2. Similarly, the weight of “university library” of # 3 is 0.469, the weight of “network” is 7.701, and the fitness of # 3 is 8.170.

【００４４】レコード適合度算出部１１１は、以上の結
果をふまえて、レコード適合度順に結果をソートする。
本例の場合は、＃１、＃３、＃２の順に出力する。The record matching degree calculation unit 111 sorts the results in order of record matching degree based on the above results.
In the case of this example, the data is output in the order of # 1, # 3, and # 2.

【００４５】[0045]

【発明の効果】以上説明したように、本発明は、各レコ
ードの項目の長さを重みの算出に考慮しているため、数
千字のレコードに１回出現している語よりも、数十字の
レコードに１回出現している語の重みを高くすることが
できる。As described above, according to the present invention, the length of the item of each record is taken into account in the calculation of the weight. The weight of a word that appears once in the cross record can be increased.

【００４６】また、ある語が、データベース内のレコー
ドにどれくらい出現しているかを考慮しているため、各
レコードに何度も出現している語よりも、特定のレコー
ドにしか出現していない語の重みを高くすることができ
る。Also, since the number of occurrences of a word in a record in the database is taken into account, a word that appears only in a specific record is more frequently used than a word that appears more than once in each record. Can be increased.

【００４７】さらに、妥当なしきい値を設定することに
より、不要語を排除して検索に有効な索引語を得ること
ができる。Further, by setting a proper threshold value, unnecessary words can be eliminated and an index word effective for retrieval can be obtained.

【００４８】以上のことにより、本発明によれば、検索
者から入力された検索語から各レコードの項目とその長
さに配慮した検索語の重みを算出でき、より正確な適合
度の順に整列された検索結果を得ることができる。As described above, according to the present invention, it is possible to calculate the weight of a search word in consideration of the items of each record and its length from the search word input by the searcher, and to arrange them in a more accurate order of conformity. The obtained search result can be obtained.

[Brief description of the drawings]

【図１】本発明の構成の一実施例を示すブロック図であ
る。FIG. 1 is a block diagram showing an embodiment of the configuration of the present invention.

【図２】レコード項目別出現頻度参照部で参照した結果
の一例を示す図である。FIG. 2 is a diagram illustrating an example of a result referred by an appearance frequency reference unit for each record item;

【図３】レコード別項目長参照部で参照した結果の一例
を示す図である。FIG. 3 is a diagram illustrating an example of a result referred by a record-specific item length reference unit;

【図４】語別出現レコード数参照部で参照した結果の一
例を示す図である。FIG. 4 is a diagram illustrating an example of a result referred by a word appearance record number reference unit;

【図５】重み付き索引語ファイルの一例を示す図であ
る。FIG. 5 is a diagram showing an example of a weighted index word file.

[Explanation of symbols]

１０１データベース１０２レコード項目別出現頻度参照部１０３レコード別項目参照部１０４語別出現レコード数参照部１０５索引語重み算出部１０６重み付き索引語ファイル１０７検索要求１０８検索式生成部１０９検索実行部１１０検索語重み算出部１１１レコード適合度算出部１１２適合度順検索結果 Reference Signs List 101 Database 102 Appearance frequency reference section by record item 103 Record item reference section 104 Record number of occurrence record reference section 105 Index term weight calculation section 106 Weighted index term file 107 Search request 108 Search expression generation section 109 Search execution section 110 Search Word Weight Calculator 111 Record Matching Calculator 112 Matching Order Search Results

Claims

(57) [Claims]

1. A database comprising a plurality of records and items, a search request for a searcher to input a search item and a search word, a search expression generation unit for generating a search expression from the search request, and using the search expression A search execution unit that performs a search of the database, and in a search result evaluation device that outputs the results of the search in the order of the relevance of the searcher, extracting words that appear in each record in the database, A record item-specific appearance frequency reference unit that specifies the record and the item of the extracted word and further refers to an appearance frequency; and a record-specific item that refers to an item length that is the length of each of the plurality of records. A long reference part, a word-based occurrence record number reference part that refers to the number of records in which the extracted word appears in the database item by item, the occurrence frequency, the item length, and the An index word weight calculation unit that calculates a weight of the extracted word based on the number of records, and generates a weighted index word file that specifies records, items, and weights of the extracted word; Receiving the weight of the input search word from the weighted index word file, and calculating a weight of the search result obtained by the search execution unit, a search word weight calculation unit, and referring to the weight of the search result, A search result evaluation device, comprising: a record relevance calculating unit that outputs relevance order search results in which records as search results are sorted in relevance order.

2. The method according to claim 1, wherein the index word weight calculation unit holds a threshold value for selecting the extracted words, and calculates the extracted words having a weight equal to or greater than the threshold value when calculating the weights. 2. The search result evaluation apparatus according to claim 1, wherein a weighted index word file is generated by specifying a record, an item, and a weight of the index word as an index word.

3. The index word weight calculation unit calculates the weight by multiplying the appearance frequency by the reciprocal of the number of records and dividing the weight by the item length. Search result evaluation device.

4. The search result evaluation apparatus according to claim 1, wherein the appearance frequency reference unit for each record item uses a natural language analysis when extracting a word.