JPH06337897A

JPH06337897A - Device and method for retrieving document

Info

Publication number: JPH06337897A
Application number: JP5151303A
Authority: JP
Inventors: Hiroyuki Sakakura; 弘行坂倉; Motoyoshi Sawatani; 元喜澤谷; Katsunobu Shibata; 克信柴田
Original assignee: Nippon Steel Corp
Current assignee: Nippon Steel Corp
Priority date: 1993-05-27
Filing date: 1993-05-27
Publication date: 1994-12-06

Abstract

PURPOSE:To easily and quickly retrieve to what degree there is a character string matching with a specific character string out of plural character strings by retrieving equivocation including incomplete matching with the specific character string in all documents and executing the conditional retrieval of an extracted document based upon a previously set retrieving field. CONSTITUTION:In the case of executing the retrieval of a specific character string for the whole documents stored in a storage device 2 from a certain client machine 4, the character string is inputted from the machine 4 as a retrieving key and the threshold of matching degree is set to a prescribed value or more. On the other hand, various documents are stored in the device 2, each document is constituted of a title and a text and a self-correlation information map in which 'electricity', 'chemical', 'machine', etc., are set as respective fields. Are previously set, at the time of retrieving all the documents by a server 1, the self-correlation information of the specific character string to be retrieved is also prepared and collated with the previously set map. Consequently high speed retrieval can be executed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書検索装置及び文書
検索方法に関し、特に複数のクライアントがアクセスす
るネットワーク上にてサーバに管理される複数の文書か
ら特定の文書を検索するための文書検索装置及び文書検
索方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval apparatus and a document retrieval method, and more particularly to a document retrieval for retrieving a specific document from a plurality of documents managed by a server on a network accessed by a plurality of clients. The present invention relates to an apparatus and a document search method.

【０００２】[0002]

【従来の技術】従来、例えば複数のクライアントとサー
バとが設けられたサーバ・クライアント型のワークステ
ーションに於て、サーバに多数の文書を保存し、クライ
アント機からネットワークを介してアクセスして検索を
行う場合、各文書内に特定の文字列（キーワード）が含
まれているか否かの検索のみが可能であり、また、通常
はその文字列全てが含まれているか否かのみを調べる完
全一致の検索が行われていた。このような検索は、特に
カタカナ表記された長い外来語等などの場合、表記の微
妙な違いにより検索できないことがあった。また、検索
対象の文書が多いなど、検索対象となる文書の全体量が
大きいと検索が著しく遅くなると云う問題があった。2. Description of the Related Art Conventionally, for example, in a server / client type workstation provided with a plurality of clients and servers, many documents are stored in a server and accessed by a client machine via a network for retrieval. When doing, it is only possible to search whether or not each document contains a specific character string (keyword), and normally, it is only necessary to check whether or not all the character strings are included. The search was being done. In such a search, especially in the case of a long foreign word written in katakana, etc., it may not be possible to search due to a subtle difference in the notation. Further, there is a problem that the search becomes significantly slow when the total amount of documents to be searched is large, such as the number of documents to be searched.

【０００３】本願出願人と同一出願人による特開平４−
３２６１６４号公報には、文書の記憶時に、同時に各文
字（コード）の自己相関情報を文書毎に記憶しておき、
検索時にキーワードの各文字の自己相関情報を求めて、
その有無を検出する構造とすることで、各検索対象文書
内に於けるキーワードの有無のみならずその一致度をも
容易に、かつ高速に調べることが可能な検索システムが
開示されている。Japanese Unexamined Patent Application Publication No. HEI 4-
In Japanese Patent No. 326164, the autocorrelation information of each character (code) is stored at the same time for each document when the document is stored.
When searching for autocorrelation information of each character of the keyword,
There is disclosed a search system that can detect the presence / absence of a keyword in each search target document as well as the degree of coincidence of the keyword easily and at high speed by using the structure for detecting the presence / absence.

【０００４】しかしながら、一般に文書に対して可能な
検索はキーワード検索のみであり、多様な文書が保存さ
れている場合、必ずしもキーワードのみで検索できると
は限らず、例えば文書の分野や著者、書式などによる検
索が有用である場合もある。However, in general, only a keyword search is possible for a document, and when various documents are stored, it is not always possible to search by only the keyword. For example, the field of the document, the author, the format, etc. Sometimes a search by is useful.

【０００５】そこで、これらの文書をフィールドの１つ
とするデータベースを構築することも考えられるが、通
常のデータベースに於けるキーワードの検索では、上記
公報に記載されたような高速検索を行うことができず、
特にあいまい検索を行おうとすると多大な時間を要する
と云う問題がある。Therefore, it is conceivable to construct a database in which these documents are used as one of the fields. However, in the keyword search in the ordinary database, the high speed search as described in the above publication can be performed. No
Especially, there is a problem that it takes a lot of time to perform a fuzzy search.

【０００６】[0006]

【発明が解決しようとする課題】本発明は上記したよう
な従来技術の問題点に鑑みなされたものであり、その主
な目的は、検索対象となる各文書に、特定の文字列とど
の程度一致する文字列があるのかを容易に、かつ高速に
検索することが可能であり、かつ条件検索も併せて可能
な文書検索装置及び文書検索方法を提供することにあ
る。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems of the prior art. The main purpose of the present invention is to search for each document to be searched with a specific character string and to what extent. An object of the present invention is to provide a document search device and a document search method that can easily and quickly search for matching character strings, and can also perform conditional search.

【０００７】[0007]

【課題を解決するための手段】上述した目的は本発明に
よれば、記憶装置に記憶された複数の文書から特定の文
書を検索するための文書検索装置であって、各文書毎に
１つ若しくは２つ以上設定された検索用フィールドを有
し、前記全文書に対して特定文字列の不完全一致をも含
むあいまい検索を行う手段と、前記あいまい検索により
抽出された文書に対して前記検索用フィールドによる条
件検索を行う手段とを有することを特徴とする文書検索
装置及び記憶装置に記憶された複数の文書から特定の文
書を検索するための文書検索方法であって、前記全文書
に対して特定文字列の不完全一致をも含むあいまい検索
を行う過程と、前記あいまい検索により抽出された文書
に対して各文書毎に１つ若しくは２つ以上予め設定され
た検索用フィールドによる条件検索を行う過程とを有す
ることを特徴とする文書検索方法を提供することにより
達成される。According to the present invention, the above object is a document retrieval apparatus for retrieving a specific document from a plurality of documents stored in a storage device, one document retrieval apparatus for each document. Alternatively, means for performing a fuzzy search including two or more set search fields and including incomplete matching of a specific character string for all the documents, and the search for the documents extracted by the fuzzy search A document retrieval method for retrieving a specific document from a plurality of documents stored in a document retrieval device and a storage device, characterized by having a means for performing a conditional retrieval by a field for use Performing a fuzzy search including an incomplete match of a specific character string, and one or two or more preset search fields for each document extracted by the fuzzy search. It is achieved by providing a document retrieval method characterized by having a process of performing conditional search.

【０００８】[0008]

【作用】このように、例えばファイル別になった各文書
に於ける通常の文書処理には使用しない領域に１つ若し
くは２つ以上検索用フィールド設定し、全文書に対して
特定文字列の不完全一致をも含むあいまい検索を可能と
し、更にこの検索結果を検索用フィールドに対する設定
条件により絞り込むことができるようにすることで、デ
ータベース構造をとらなくても記憶装置に記憶された複
数の文書に対してあいまい検索を含む多様な検索を高速
に行うことができる。In this way, for example, one or more search fields are set in an area that is not used for normal document processing in each document for each file, and incomplete character strings are specified for all documents. By enabling a fuzzy search that also includes matches, and further narrowing down the search results by the setting conditions for the search field, multiple documents stored in the storage device can be stored even if the database structure is not taken. Various searches including fuzzy searches can be performed at high speed.

【０００９】[0009]

【実施例】以下、本発明の好適実施例を添付の図面につ
いて詳しく説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT A preferred embodiment of the present invention will now be described in detail with reference to the accompanying drawings.

【００１０】図１は、本発明が適用されたサーバ・クラ
イアント型のワークステーションのシステム構成を示す
ブロック図である。このシステムは、大容量記憶装置２
を有するサーバ１と、このサーバ１に公知のネットワー
ク３を介して接続された複数のクライアント機４とを有
している。FIG. 1 is a block diagram showing a system configuration of a server / client type workstation to which the present invention is applied. This system is a mass storage device 2
And a plurality of client machines 4 connected to the server 1 via a known network 3.

【００１１】記憶装置２内には多様な文書が記憶されて
いる。図２には各文書の構造及び各文書に設定されたフ
ィールドを示している。各文書自体は「タイトル」と
「本文」とから構成されている。また、フィールドとし
て、「電気」、「化学」、「機械」、「情報」、「その
他」が設定されており、各フィールドのうち、その文書
が該当する分野のフィールドにフラグが立っている。こ
のフラグは複数のフィールドに立っていても良いことは
云うまでもない。Various documents are stored in the storage device 2. FIG. 2 shows the structure of each document and the fields set in each document. Each document itself is composed of a "title" and a "body". In addition, "electricity", "chemistry", "machine", "information", and "others" are set as fields, and a flag is set in a field of a field to which the document corresponds among the fields. It goes without saying that this flag may stand in multiple fields.

【００１２】例えば文書１は電気回路の特許公開公報で
あり、フィールド「電気」にフラグが立っており、文書
２は新聞記事であり、フィールド「化学」にフラグが立
っている。このように各文書１〜ｎ（番号１〜ｎはサー
バが管理するＩＤ番号）には各々、いずれかのフィール
ドにフラグが立っている。For example, Document 1 is a patent publication of an electric circuit, the field "electricity" is flagged, Document 2 is a newspaper article, and the field "chemistry" is flagged. In this manner, each of the documents 1 to n (numbers 1 to n are ID numbers managed by the server) has a flag set in any field.

【００１３】図３に示すように、サーバ１には上記した
自己相関情報から特定文字列を検索し、その一致度を判
断するための検索部１１と、該検索部１１からの検索結
果から、各文書毎に、その「タイトル」と「本文」とに
対する一致度を集計して最も高い一致度をその文書の一
致度とする集計処理部１２とが設けられている。As shown in FIG. 3, the server 1 searches for a specific character string from the above-mentioned autocorrelation information, and a search unit 11 for determining the degree of matching, and a search result from the search unit 11, For each document, there is provided an aggregation processing unit 12 that aggregates the degree of coincidence between the “title” and the “body” and sets the highest degree of coincidence as the degree of coincidence of the document.

【００１４】以下に、本実施例に於ける文書の検索処理
手順を説明する。The document retrieval processing procedure in this embodiment will be described below.

【００１５】或るクライアント機４から特定の文字列、
例えば文字列「フィードフォワード」の検索を記憶装置
２に記憶された全文書に対して行う場合、クライアント
機４から文字列を「検索キー」として入力すると共に後
記する一致度の閾値を例えば７０％以上と設定する。そ
して、このクライアント機４がサーバ１にアクセスし、
サーバ１にて記憶装置２に記憶された全文書に対して実
際に検索が行われる。このとき、上記したように予め各
文書の自己相関情報がマップとして作成され記憶されて
いることから、文字列「フィードフォワード」について
も自己相関情報を作成して上記マップに照合するのみで
高速な検索を行うことができるようになっている。この
検索の速度は全文書の容量には殆ど依存せず、検索する
文字列の長さに依存するものである。ここで、クライア
ント機４の図示されない画面にて検索範囲を「タイト
ル」または「本文」に限定することも可能である。ただ
し、実際の検索はこの限定に関わらず全文書の「タイト
ル」及び「本文」に対して行われる。A specific character string from a client machine 4,
For example, when performing a search for the character string “feedforward” on all the documents stored in the storage device 2, the character string is input as a “search key” from the client machine 4 and the threshold of the degree of coincidence described later is, for example, 70%. Set as above. Then, this client machine 4 accesses the server 1,
The server 1 actually searches for all the documents stored in the storage device 2. At this time, since the autocorrelation information of each document is created and stored in advance as a map as described above, only the autocorrelation information of the character string "feedforward" is also created and collated with the above map for high speed. You can search. The speed of this search hardly depends on the capacity of the entire document, but depends on the length of the character string to be searched. Here, it is possible to limit the search range to "title" or "body" on a screen (not shown) of the client machine 4. However, the actual search is performed on the "title" and "text" of all documents regardless of this limitation.

【００１６】検索部１１にて行われた検索結果はそのま
ま集計処理部１２に送られる。この集計処理部１２にて
検索部１１からの検索結果から、各文書毎に、その「タ
イトル」と「本文」とに対する一致度を集計して最も高
い一致度をその文書の一致度とする処理が行われる。こ
のとき、検索範囲が設定されていれば、その設定された
範囲に於て最も高い一致度をその文書の一致度とする。The search result obtained by the search unit 11 is sent to the totalization processing unit 12 as it is. A process of totalizing the degree of coincidence between the “title” and the “text” of each document from the retrieval result from the retrieval unit 11 by the totalization processing unit 12 and setting the highest degree of coincidence as the degree of coincidence of the document. Is done. At this time, if the search range is set, the highest matching score in the set range is set as the matching score of the document.

【００１７】そして、その結果がクライアント機４に送
られ、図４に示すように、ディスプレイに上記設定閾値
以上の一致度の文書を一括表示する。そして、操作者が
例えば図４に於ける「ソート」キーをマウスなどのポイ
ンティングデバイスによりクリックすることにより一致
度の高い順に並べ換えて表示する。Then, the result is sent to the client machine 4, and as shown in FIG. 4, the documents having the degree of coincidence equal to or higher than the above-mentioned set threshold value are collectively displayed on the display. Then, the operator clicks the "sort" key in FIG. 4 with a pointing device such as a mouse to rearrange and display the images in descending order of the degree of coincidence.

【００１８】次に、図５の画面にて分野の限定条件とし
て、「電気」、「化学」、「機械」、「情報」、「その
他」のいずれか１つ若しくは２つ以上をＯＲ条件または
ＡＮＤ条件で選択する。この条件により図４の画面に表
示された文書が更に絞り込まれることとなる。Next, as a field limiting condition on the screen of FIG. 5, any one or more of "electricity", "chemistry", "machinery", "information", and "others" are OR conditions or Select by AND condition. This condition further narrows down the documents displayed on the screen of FIG.

【００１９】本実施例では検索文字列を入力し、その検
索結果を表示した後、分野の限定条件を付加して更に絞
り込むようにしたが、検索文字列及び分野の限定条件を
続けて入力した後、検索を行っても良く、この場合、サ
ーバ側の処理に変更は殆どなく、クライアント側の表示
順序等を変更するのみである。In this embodiment, after inputting a search character string and displaying the search result, a field limiting condition is added to further narrow down the search. However, the search character string and field limiting condition are continuously input. Later, a search may be performed. In this case, there is almost no change in the processing on the server side, and only the display order on the client side is changed.

【００２０】[0020]

【発明の効果】以上の説明により明らかなように、本発
明による文書検索装置及び文書検索方法によれば、各文
書に於ける通常の文書処理には使用しない領域に１つ若
しくは２つ以上検索用フィールド設定し、全文書に対し
て特定文字列の不完全一致をも含むあいまい検索を可能
とし、更にこの検索結果を検索用フィールドに対する設
定条件により絞り込むことができるようにすることで、
データベース構造をとらなくても記憶装置に記憶された
複数の文書に対してあいまい検索を含む多様な検索を高
速に行うことができることから多数の記憶された文書の
取扱い性及び検索側の操作性が向上する。As is apparent from the above description, according to the document retrieval apparatus and the document retrieval method of the present invention, one or more retrievals are made in the areas of each document that are not used for normal document processing. By setting the search field, it is possible to perform a fuzzy search that includes incomplete matching of specific character strings for all documents, and further narrow down this search result by the setting conditions for the search field.
Since various searches including ambiguous searches can be performed at high speed on a plurality of documents stored in a storage device without using a database structure, handling of a large number of stored documents and operability on the search side are improved. improves.

[Brief description of drawings]

【図１】本発明が適用されたサーバ・クライアント型の
ワークステーションのシステム構成を示すブロック図で
ある。FIG. 1 is a block diagram showing a system configuration of a server / client type workstation to which the present invention is applied.

【図２】記憶装置に記憶された文書の構造を示す説明図
である。FIG. 2 is an explanatory diagram showing a structure of a document stored in a storage device.

【図３】本発明が適用されたサーバ・クライアント型の
ワークステーションに於けるサーバ及びクライアント機
の機能構成の一部を示すブロック図である。FIG. 3 is a block diagram showing a part of a functional configuration of a server and a client machine in a server / client type workstation to which the present invention is applied.

【図４】クライアント機のディスプレイの検索画面を示
す説明図である。FIG. 4 is an explanatory diagram showing a search screen on the display of the client machine.

【図５】クライアント機のディスプレイの検索画面を示
す説明図である。FIG. 5 is an explanatory diagram showing a search screen on the display of the client machine.

[Explanation of symbols]

１サーバ２記憶装置３ネットワーク４クライアント機１１検索部１２集計処理部 1 Server 2 Storage Device 3 Network 4 Client Device 11 Search Unit 12 Aggregation Processing Unit

Claims

[Claims]

1. A document retrieval device for retrieving a specific document from a plurality of documents stored in a storage device, which has one or two or more retrieval fields set for each document, It has means for performing a fuzzy search including an incomplete match of a specific character string on all the documents, and means for performing a conditional search by the search field for the documents extracted by the fuzzy search. Document retrieval device.

2. A document search method for searching a specific document from a plurality of documents stored in a storage device, wherein a fuzzy search including an incomplete match of a specific character string is performed on all the documents. A document search method comprising: a step of performing a conditional search using one or two or more preset search fields for each document for the documents extracted by the fuzzy search.