JP2007213601A

JP2007213601A - Method for searching a plurality of databases and method for searching for literature between the plurality of databases

Info

Publication number: JP2007213601A
Application number: JP2007100194A
Authority: JP
Inventors: T Kirsch Stephen; スティーブン・ティー・キルシュ
Original assignee: Rakuten Inc
Current assignee: Rakuten Group Inc
Priority date: 2007-04-06
Filing date: 2007-04-06
Publication date: 2007-08-23

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for literature search using a plurality of databases used from at least one server by use of at least one search engine. <P>SOLUTION: The number of records is determined and is reported for each the database, and frequency of a hit or appearance of searched query term is determined and is reported together with identification of the record of the database corresponding to the hit. The reports from the plurality of databases are given to a user terminal, i.e., a client, and client software calculates a related score for each the record on the basis of the number of the records inside the database, the number of the records each having at least one hit, and the number of the hits about each the record. By local calculation from unified data, coherent ranking is achieved as if resulting from a single database, for all the literatures. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、文献の探索および検索に関し、特にネットワークを介してのそれらに関する。 The present invention relates to literature searching and searching, and more particularly to those via a network.

２０年以上にわたって、情報サービスは複数のデータベースでのアクセスを提供してきた。たとえば、現在ナイト−リダー・インフォメーション・インコーポレイテッド（Knight-Ridder Information, Inc.）として知られているダイアログ・インフォメーション・サービス（Dialog Information Services）は、探索を行なう者に利用可能な数百のデータベース（コレクションとしても知られる）を提供する。これらのデータベースのうちいくつかは文献目録的な要約を含むものであり、一方で他のものはフルテキストの文献を含んでいる。探索を行なう者は１つまたは複数個のデータベースに対しクエリを適用することができる。始めに、探索を行なう者は、過去の経験に基づいて関心をひく個々のデータベースを選択するか、または情報のプロバイダにより選択されかつ特定のトピックに関連しているデータベース群を選択する。たとえば、探索を行なう者は特許というトピックを選択するかもしれない。このトピックについて情報サービスは特許に限定されたいくつかのデータベースをまとめてある。データベースの群にクエリが適用されると、情報サービスは各データベースにおけるヒット件数を検索する。探索を行なう者は次に関心をひくデータベースにアクセスし、個々の記録を閲覧する。このシステムは元々、所望の情報を得るためにどこを探せばよいかわかっている、司書や職業的な研究者のために設計されたものであった。 For over 20 years, information services have provided access to multiple databases. For example, Dialog Information Services, now known as Knight-Ridder Information, Inc., has hundreds of databases available to searchers ( (Also known as a collection). Some of these databases contain bibliographic summaries, while others contain full-text literature. A searcher can apply a query to one or more databases. Initially, the searcher selects an individual database of interest based on past experience, or selects a set of databases selected by an information provider and associated with a particular topic. For example, a searcher may select the topic of patents. Information services on this topic have compiled several databases limited to patents. When a query is applied to a group of databases, the information service retrieves the number of hits in each database. The searcher then accesses the database of interest and browses the individual records. The system was originally designed for librarians and professional researchers who knew where to look for the information they wanted.

インターネットなどのワイドエリアネットワークが利用可能となるにつれ、探索のプロだけではなく、素人のユーザにも、探索において新しい機会が得られるようになった。ユーザに文献目録的な研究データや文献を提供するために、私設のものだけではなく公共のデータベースをも用いる、新しいタイプの情報プロバイダが出てきている。ユーザが、特許など、あるトピックに関心を持っている場合、そのユーザは探索を行なうためにどのようなリソースを集めることができるかも、それらリソースのロケーションも、知らないかもしれない。リソースはしばしば変更されるので、探索を行なう者は、応答の適切さに比べて応答の供給源には低い関心しか持たないだろう。ワイドエリアネットワークを介して利用可能な分散したコレクションは、単一のコレクションとして扱うことができるということは、認識されている。各サブコレクションは個別に探索され、レポートは単一のリストに結合される。また文献は、特定のコレクションの性質を考慮にいれつつ、アルゴリズムに従ってサーチエンジンによりランク付けおよび重み付けされ得るものであることも知られている。文献のスコアは、個々の文献のコレクションが単一の統合されたコレクションにマージされたならば得られるであろうスコアを得るために正規化することができる。 As wide area networks such as the Internet become available, not only search professionals but also amateur users have new opportunities in search. A new type of information provider has emerged that uses public databases as well as private ones to provide bibliographic research data and literature to users. If a user is interested in a certain topic, such as a patent, the user may not know what resources can be gathered for searching and the location of those resources. Since resources are often changed, the searcher will have less interest in the source of the response compared to the appropriateness of the response. It is recognized that distributed collections available over a wide area network can be treated as a single collection. Each sub-collection is searched individually and the reports are combined into a single list. It is also known that the literature can be ranked and weighted by a search engine according to an algorithm, taking into account the properties of a particular collection. The literature scores can be normalized to obtain a score that would be obtained if the individual literature collections were merged into a single consolidated collection.

先行技術に存在する問題点の１つは、各文献についてのスコアが絶対ではなく、各コレクションにおける統計およびサーチエンジンと関連しているアルゴリズムに依存しているということである。存在する第２の問題点は、標準的な先行技術の手順には、２つのパスが必要だということである。第１のパスでは、各クエリの語についての重みを計算するために、各サーチエンジンから統計が集められる。第２のステップでは、第１のステップからの情報が各サーチエンジンに送り返され、サーチエンジンは次に各ヒットまたは識別された文献に特定の重みまたはスコアを割当てる。存在する第３の問題は、先行技術ではすべてのコレクションが同一のサーチエンジンを使用することが要求されるということである。 One of the problems existing in the prior art is that the score for each document is not absolute and depends on the statistics in each collection and the algorithm associated with the search engine. A second problem that exists is that the standard prior art procedure requires two passes. In the first pass, statistics are gathered from each search engine to calculate the weights for each query term. In the second step, the information from the first step is sent back to each search engine, which then assigns a particular weight or score to each hit or identified document. A third problem that exists is that the prior art requires that all collections use the same search engine.

この発明の目的は、２つの異なるデータベースに同じ文献が現われた場合に、その結果がマージされたときにそれが同じようにスコアをつけられるであろうように、一貫した基準に基づく文献のランク付けを伴う、単一パスで複数のコレクションを探索するための方法を考案することである。すべてのコレクションについて同一のサーチエンジンを用いる必要はない。 The purpose of this invention is to rank documents based on consistent criteria so that if the same document appears in two different databases, it will be scored in the same way when the results are merged. To devise a method for searching multiple collections in a single pass, with attachment. It is not necessary to use the same search engine for all collections.

Summary of the Invention

上述の目的は、参与している各サーチエンジンサーバに、返される文献の各々における各クエリの語についての統計を返送するように要求する、文献の探索および検索方法で達成されている。最終的な関連スコアはその後、サーバではなくクライアント端末において計算される。この態様で、すべての関連スコアが、サーチエンジンの違いとは無関係に同一の態様でクライアントにおいて処理される。 The above objective is accomplished with a document search and search method that requires each participating search engine server to return statistics about the terms of each query in each of the returned documents. The final relevance score is then calculated at the client terminal, not the server. In this manner, all relevance scores are processed at the client in the same manner regardless of search engine differences.

Best mode for carrying out the invention

図１を参照して、クエリブロック１１により示されるクエリはユーザによって明確にされ、端末またはクライアントシステムに与えられる。このクエリはネットワークインタフェース１３に電子的に送信される。ネットワークインタフェースとは、クエリの主題に関連するデータベースを有するソース１７へのアクセスを有する情報サービスである。これらの、他のサーバに配備されたデータベースは、通信チャネル１５を介して同時にポーリングされる。通信チャネル１５はソース１７へのワイドエリアネットワークのリンクであってもよい。インターネットはそのようなワイドエリアネットワークリンクおよび遠隔のソースのための１つのモデルである。クエリはサーチエンジンに与えられる。これらのサーチエンジンは列２０、３０、および４０で表わされており、各サーチエンジンはブロック１９において関連のデータベースにアクセスしている。各サーチエンジンは、ブール論理、統計的推論等の、独自の演算特性を有していてもよい。各データベースは、ブロック２１により示される、データベース内のレコードの数Ｎを含むレポートを生成する。このレポートには、クエリに応答する文献内に各探索語が現われる回数も含まれる。この量、Ｎ_TERMは、ブロック２３により示される。第３に、レポートはブロック２５で示されるように、ヒットを含む各文献の文献識別番号を、各探索語の現われる回数とともに生成する。この情報から、ブロック２７で示されるように、各文献についてのスコアの計算が、クライアントソフトウェアによりそれ自体のアルゴリズムを用いて行なわれる。たとえば、スコアを計算するための公式は次のとおりである。 Referring to FIG. 1, the query indicated by query block 11 is clarified by the user and provided to the terminal or client system. This query is transmitted electronically to the network interface 13. A network interface is an information service that has access to a source 17 that has a database associated with the subject of the query. These databases deployed on other servers are polled simultaneously via the communication channel 15. Communication channel 15 may be a wide area network link to source 17. The Internet is one model for such wide area network links and remote sources. The query is given to the search engine. These search engines are represented in columns 20, 30, and 40, with each search engine accessing an associated database at block 19. Each search engine may have its own operational characteristics, such as Boolean logic, statistical inference. Each database generates a report containing the number N of records in the database, indicated by block 21. The report also includes the number of times each search term appears in the literature that responds to the query. This quantity, N _TERM, is indicated by block 23. Third, the report generates a document identification number for each document that contains a hit along with the number of times each search word appears, as indicated by block 25. From this information, as shown in block 27, the score for each document is calculated by the client software using its own algorithm. For example, the formula for calculating the score is:

ここでｔ_f＝文献中にその語が現われる回数であり、ｉｄｆ＝ｌｏｇ（Ｎ／Ｎ_TERM）であり、この場合ＮおよびＮ_TERMは、すべてのコレクションにより報告されたＮおよびＮ_TERM値の和である。計算されたスコアはブロック２９により示される出力バッファに送信され、この出力バッファはクエリを行なっている人物によって要求されていた上位Ｍ個のスコアを計算ブロック２７から移行させる。このデータベースを通じて、ただ１つのパスしか行なわれていないことに注目されたい。計算されたスコアは絶対値として扱われる。 Where t _f = the number of times the word appears in the document, idf = log (N / N _TERM ), where N and N _TERM are the sum of N and N _TERM values reported by all collections It is. The calculated score is sent to the output buffer indicated by block 29, which shifts the top M scores requested by the querying person from calculation block 27. Note that there is only one pass through this database. The calculated score is treated as an absolute value.

代替的実施例では、アルゴリズムにおいて用いるためにオプションのパラメータが報告されてもよい。ブロック２６は、クライアントの、より洗練されたランク付けの公式を用いる目的で、各文書中最も頻出する語の頻度が報告されることを示す。もう１つのオプション的なデータ低減ステップとしては、先行技術において知られている態様で、各サーチエンジンが文献の関連性についてのスコアを計算してもよいというものである。このデータから、サーチエンジンはデータベースにおける上位Ｍ個までのヒットを予め選択することができるだろう。ここでＭはユーザによって要求されるヒットの最大件数である。 In alternative embodiments, optional parameters may be reported for use in the algorithm. Block 26 indicates that the frequency of the most frequent word in each document is reported for the purpose of using the client's more sophisticated ranking formula. Another optional data reduction step is that each search engine may calculate a score for document relevance in a manner known in the prior art. From this data, the search engine will be able to pre-select the top M hits in the database. Here, M is the maximum number of hits requested by the user.

例としては、探索クエリは「グラフィカル・ユーザ・インタフェース」という語を伴う文献に関わるものであるかもしれない。下の表１は、いくつかの最もランクの高い文献を選択したサーチエンジンにより生成されるレポートを示す。このレポートはユーザのクライアントソフトウェアに返され、ここでユーザは各サーチエンジンによって返された語の頻度データおよび文献の頻度データを用いて、上述の公式（１）におけるものなどのアルゴリズムを適用する。したがって、各クエリ語について文献の重みの局所的な計算が行なわれ、各コレクションから返されるＮ_TERMおよびＮ（＝文献数）が結合される。したがって、語の重み付けは、これらのコレクションが単一のコレクションであったかのように、全く同じものとなる。異なったサーチエンジンが探索に参与した場合でもスコア付けは完全に首尾一貫しており、２つの異なるコレクションに現われた同一文献は、常に同一のスコアを受け取ることとなる。 As an example, a search query may involve documents with the word “graphical user interface”. Table 1 below shows a report generated by a search engine that selected some of the highest ranking documents. This report is returned to the user's client software, where the user applies an algorithm such as that in formula (1) above using the word frequency data and literature frequency data returned by each search engine. Therefore, a local calculation of the document weight is performed for each query word, and N _TERM and N (= the number of documents) returned from each collection are combined. Thus, the word weights are exactly the same as if these collections were a single collection. Scoring is completely consistent even when different search engines participate in the search, and the same document appearing in two different collections will always receive the same score.

本発明におけるシステムのブロック図である。It is a block diagram of the system in the present invention.

Explanation of symbols

１１クエリ
１３ネットワークインタフェース
１５通信チャネル
１７ソース
１９データベース
２０サーチエンジン
３０サーチエンジン
４０サーチエンジン 11 Query 13 Network interface 15 Communication channel 17 Source 19 Database 20 Search engine 30 Search engine 40 Search engine

Claims

A method for searching a plurality of databases that are distributed and accessible to clients via one or more search servers, comprising:
(A) providing a search query from a client to each database and each associated server;
(B) obtaining statistics for each database at each client from each server;
(C) obtaining information about a document obtained as a result of applying a query to a database at each client from each server;
(D) at the client, calculating a score for each document using the statistics and the information, the calculated scores for all databases as if they were combined as a single database To search multiple databases that appear as applicable.

The method of claim 1, wherein the statistics for the collection include the size of the collection in relation to the number of records.

The method of claim 1, wherein the information about each document includes the number of times each search term appears in the document.

The method of claim 1, wherein the information about each database includes the number of documents that contain each search term.

A method for searching a plurality of databases that are distributed and accessible to clients via one or more servers, comprising:
(A) accessing each database from a client;
(B) providing a search query from a client to a server associated with each database;
(C) obtaining statistics for each database at the client;
(D) obtaining statistical information about relevant documents obtained as a result of applying the query to the database at the client;
(E) calculating a score for a related document using the statistics and the information at the client, wherein the score calculated for a document is independent of the database in which it appears; A method for searching multiple databases.

A method of searching for documents among a plurality of databases in response to a search query,
(A) providing a search query for each database;
(B) determining the number of records for each database;
(C) applying a search query to each of the databases to record the number of hits for each search query word and the identification of a database record having at least one hit in the number of hits;
(D) counting records having at least one hit for each of the databases and for each query term;
(E) reporting the associated score of each record for the search query to the user, calculated using the results of steps (b), (c) and (d), between a plurality of databases How to search for literature.

7. The method of claim 6, further defined by selecting several databases and having more than one search engine for the databases prior to providing the same search query for all databases.

8. The method of claim 7, wherein by selecting a number of records to be reviewed between the number of databases, the number of records having the highest relevance score for a search query. .

9. The method of claim 8, further defined by pre-selecting a number of records prior to calculating an associated score.