JP5802924B2

JP5802924B2 - Document search system and document search program

Info

Publication number: JP5802924B2
Application number: JP2011167158A
Authority: JP
Inventors: 顕足立
Original assignee: アーカイブ技術研究所株式会社
Priority date: 2011-07-29
Filing date: 2011-07-29
Publication date: 2015-11-04
Anticipated expiration: 2031-07-29
Also published as: JP2013030089A

Description

本発明は、文書データベースを検索して検索結果を表示する文書検索システムおよび文書検索プログラムに関する。 The present invention relates to a document search system and a document search program that search a document database and display search results.

従来から、膨大な情報の中から必要な情報を抽出するために検索システムが利用されている。一般的な検索システムでは、入力された検索語を多く含む文書が上位に表示され、更には他の要素も考慮した表示がなされる（特許文献１段落００２９等参照）。 Conventionally, a search system has been used to extract necessary information from a vast amount of information. In a general search system, a document containing a lot of input search terms is displayed at the top, and further, other factors are taken into consideration (see paragraph 0029 of Patent Document 1).

特開２００９−１８７２１１号公報JP 2009-187211 A

しかし、上記の表示方法では、入力された検索語に、汎用語と非汎用語とが含まれている場合には、汎用語を多く含む文書が上位に表示され、ユーザが真に必要とする文書が下位に表示されてしまう。また、上記の表示方法では、複数の検索語を用いた場合に、単に網羅的な記載がなされているだけで、検索語同士が関連し合った文章を含まない文書が上位にランキングされることがある。その結果、ユーザは、真に必要とする文書を探し出すために、検索結果の文書を順次表示させて閲覧しなければならず、多大な時間と労力をかけなければならないという問題があった。 However, in the above display method, when the input search word includes general-purpose words and non-general-purpose words, a document containing a large number of general-purpose words is displayed at the top, and the user really needs it. The document is displayed at the bottom. In addition, in the above display method, when a plurality of search terms are used, documents that do not include sentences in which the search terms are associated with each other are simply ranked in a high rank. There is. As a result, in order to search for a document that is really necessary, the user has to display and browse the search result document sequentially, which requires a great deal of time and effort.

もっとも、検索の上手なユーザは、汎用語を避けて検索することができるかもしれない。しかし、いくら検索の上手なユーザであっても、検索対象のデータベースにおいて、どの用語が汎用語であるか否かを検索前に把握することは難しい。以上のことから、ユーザが汎用語を含む複数の検索語を使った場合であっても、ユーザが真に必要とする文書を上位に表示する検索システムの登場が待たれている。 However, a user who is good at searching may be able to search while avoiding general-purpose words. However, no matter how good the user is, it is difficult to know which term is a general term in the database to be searched before searching. From the above, even if the user uses a plurality of search words including general-purpose words, the advent of a search system that displays a document that the user really needs at the top is awaited.

本発明はかかる問題点に鑑みてなされたものであり、その目的は、汎用語の影響を低減するとともに、検索語同士が関連し合った文章を含む文書を上位に表示することの可能な文書検索システムおよび文書検索プログラムを提供することにある。 The present invention has been made in view of such a problem, and an object of the present invention is to reduce the influence of general-purpose words and to display a document including a sentence in which search terms are related to each other at a higher level. To provide a search system and a document search program.

本発明による文書検索システムは、検索対象の文書群における各文書がｎ文字単位（ｎ≧１）で分割されることにより得られた単語ごとの出現頻度が各文書の形式区切りごとに登録されたインデックスを利用したシステムである。このシステムは、以下の３つの構成要素を備えている。
（Ａ１）与えられた検索条件を解析し、この検索条件に含まれる各検索語をｎ文字単位で分割する分割部
（Ａ２）上記のインデックスを利用して、各検索語の分割により得られた単語ごとの出現頻度を、上記のインデックスに登録された形式区切りごとに抽出する抽出部
（Ａ３）抽出部で抽出された単語ごとの出現頻度を利用するとともに、各検索語の文書内に構成される形式区切り内での位置情報を利用しないで、各検索語の形式区切りごとの出現頻度と、各検索語の汎用度とを計算し、この計算により得られた出現頻度および汎用度を利用して、各文書の重みを計算する重み付け部 In the document search system according to the present invention, the appearance frequency for each word obtained by dividing each document in the search target document group in units of n characters (n ≧ 1) is registered for each format division of each document. This is a system that uses an index. This system includes the following three components.
(A1) A dividing unit that analyzes a given search condition and divides each search word included in the search condition in units of n characters. (A2) Obtained by dividing each search word using the above index. An extraction unit that extracts the appearance frequency for each word for each format break registered in the above index (A3) The frequency of occurrence for each word extracted by the extraction unit is used, and is configured in the document of each search word. Without using location information within the format break, calculate the appearance frequency of each search term for each format break and the generality of each search word, and use the appearance frequency and versatility obtained by this calculation. A weighting unit that calculates the weight of each document

本発明による文書検索プログラムは、検索対象の文書群における各文書がｎ文字単位（ｎ≧１）で分割されることにより得られた単語ごとの出現頻度が各文書の形式区切りごとに登録されたインデックスを利用したプログラムである。このプログラムは、以下の３つのステップをコンピュータに実行させるものである。
（Ｂ１）与えられた検索条件を解析し、この検索条件に含まれる各検索語をｎ文字単位で分割する第１ステップ
（Ｂ２）上記のインデックスを利用して、各検索語の分割により得られた単語ごとの出現頻度を、上記のインデックスに登録された形式区切りごとに抽出する第２ステップ
（Ｂ３）抽出部で抽出された単語ごとの出現頻度を利用するとともに、各検索語の文書内に構成される形式区切り内での位置情報を利用しないで、各検索語の形式区切りごとの出現頻度と、各検索語の汎用度とを計算し、この計算により得られた出現頻度および汎用度を利用して、各文書の重みを計算する第３ステップ
In the document search program according to the present invention, the appearance frequency for each word obtained by dividing each document in the search target document group in units of n characters (n ≧ 1) is registered for each format division of each document. This program uses an index. This program causes a computer to execute the following three steps.
(B1) First step of analyzing a given search condition and dividing each search word included in the search condition in units of n characters. (B2) Obtained by dividing each search word using the above index. and the frequency of occurrence of each word, as well as utilize the frequency of appearance of each word extracted in the second step (B3) extraction unit for extracting each form separator which is indexed above, in the document for each search term Without using the location information within the configured format break, calculate the appearance frequency of each search term for each format break and the generality of each search term, and calculate the appearance frequency and generality obtained by this calculation. Third step to calculate the weight of each document using

本発明による文書検索システムおよび文書検索プログラムでは、上記のインデックスを利用して、各検索語の汎用度が計算される。このように、本発明では、各検索語の汎用度が検索時に導出されるので、汎用度を考慮したランキング表示が可能になる。また、各検索語の汎用度が計算により導出されるので、ユーザが、検索条件として入力する言葉が汎用語であるか否かを気にする必要がなくなる。また、本発明では、上記のインデックスを利用して、文書よりも小さな形式区切りごとに各検索語の出現頻度が計算される。これにより、単に網羅的な記載がなされているだけで、検索語同士が関連し合っていない文書が上位にランキングされるのを防ぐことができる。 In the document search system and the document search program according to the present invention, the versatility of each search term is calculated using the above-described index. As described above, in the present invention, since the versatility of each search term is derived at the time of search, ranking display in consideration of the versatility can be performed. In addition, since the general degree of each search word is derived by calculation, the user does not need to worry about whether or not the word input as the search condition is a general word. Further, in the present invention, the appearance frequency of each search word is calculated for each format break smaller than the document using the above-described index. As a result, it is possible to prevent a document in which search terms are not related to each other from being ranked high by simply making an exhaustive description.

本発明による文書検索システムおよび文書検索プログラムにおいて、形式区切りは、例えば、ページ、段落、章、または節である。本発明による文書検索システムは、重み付け部で得られた重みを利用して、各文書のランキングを決定するマージ部をさらに備えていてもよい。本発明による文書検索システムは、マージ部だけでなく、さらに、マージ部で決定されたランキングに従って各文書を表示させる検索結果表示部をさらに備えていてもよい。ここで、検索結果表示部は、各文書において出現頻度が最大となる形式区切りを含む連続した複数の形式区切りのレイアウトを表示させるようになっていてもよい。また、検索結果表示部は、各文書において出現頻度が最大となる形式区切りのレイアウトを表示させるようになっていてもよい。 In the document search system and the document search program according to the present invention, the format separator is, for example, a page, a paragraph, a chapter, or a section. The document search system according to the present invention may further include a merging unit that determines the ranking of each document using the weight obtained by the weighting unit. The document search system according to the present invention may further include not only the merge unit but also a search result display unit that displays each document according to the ranking determined by the merge unit. Here, the search result display unit may display a plurality of continuous format partition layouts including the format partition having the highest appearance frequency in each document. In addition, the search result display unit may display a format-delimited layout that maximizes the appearance frequency in each document.

本発明による文書検索システムおよび文書検索プログラムにおいて、ｎ文字単位が複数の文字単位を含み、インデックスがｎ文字単位に含まれる文字単位ごとのインデックスを含んでいてもよい。この場合に、分割部は、各検索語を各文字単位で分割するようになっていてもよい。さらに、抽出部は、インデックスを利用して、分割部での分割により得られた単語ごとの出現頻度を、インデックスに登録された形式区切りごと、および文字単位ごとに抽出するようになっていてもよい。さらに、重み付け部は、抽出部での抽出により得られた単語ごとの出現頻度を利用して、各検索語の形式区切りごとおよび文字単位ごとの出現頻度と、各検索語の汎用度とを計算し、この計算により得られた出現頻度および汎用度を利用して、各文書の重みを計算する要になっていてもよい。 In the document search system and the document search program according to the present invention, the n character unit may include a plurality of character units, and the index may include an index for each character unit included in the n character unit. In this case, the dividing unit may divide each search word in units of characters. Further, the extraction unit may extract an appearance frequency for each word obtained by division in the division unit by using an index for each format break registered in the index and for each character unit. Good. Furthermore, the weighting unit uses the appearance frequency for each word obtained by the extraction in the extraction unit to calculate the appearance frequency for each format term and each character unit, and the generality of each search word. However, the weight of each document may be calculated using the appearance frequency and the versatility obtained by this calculation.

本発明による文書検索システムおよび文書検索プログラムによれば、検索対象の文書群から得られたｎ文字単位の単語ごとの出現頻度が各文書の形式区切りごとに登録されたインデックスを利用して、各検索語の汎用度と、文書よりも小さな形式区切りごとに各検索語の出現頻度とを計算するようにしたので、汎用語の影響を低減するとともに、検索語同士が関連し合った文章を含む文書を上位に表示することができる。 According to the document search system and the document search program of the present invention, the appearance frequency for each word in units of n characters obtained from the document group to be searched is registered using each index registered for each format delimiter. Calculation of the generality of the search terms and the frequency of occurrence of each search term for each format break smaller than the document reduces the influence of the general terms and includes sentences in which the search terms are related to each other Documents can be displayed at the top.

本発明による一実施の形態に係る文書検索システムの機能ブロック図である。It is a functional block diagram of the document search system which concerns on one embodiment by this invention. 図１のインデックス登録部の機能ブロック図である。It is a functional block diagram of the index registration part of FIG. インデックス構造の一例を表す図である。It is a figure showing an example of an index structure. インデックスの一例を表す図である。It is a figure showing an example of an index. 図１の検索部の機能ブロック図である。It is a functional block diagram of the search part of FIG. 図５の検索部における演算の一例を表す図である。It is a figure showing an example of the calculation in the search part of FIG. 検索結果の表示の一例を表す図である。It is a figure showing an example of a display of a search result. 検索結果の表示の他の例を表す図である。It is a figure showing the other example of a display of a search result. 検索結果の表示のその他の例を表す図である。It is a figure showing the other example of a display of a search result. 図１の文書検索システムの一変形例の機能ブロック図である。It is a functional block diagram of the modification of the document search system of FIG. 図１０のインデックス構造の一例を表す図である。It is a figure showing an example of the index structure of FIG. 図１０の検索部の機能ブロック図である。It is a functional block diagram of the search part of FIG. 図１の文書検索システムの第１応用例の構成図である。It is a block diagram of the 1st application example of the document search system of FIG. 図１の文書検索システムの第２応用例の構成図である。It is a block diagram of the 2nd application example of the document search system of FIG. 図１の文書検索システムの第３応用例の構成図である。It is a block diagram of the 3rd application example of the document search system of FIG. 図１の文書検索システムの第４応用例の構成図である。It is a block diagram of the 4th application example of the document search system of FIG.

以下、発明を実施するための形態について、図面を参照して詳細に説明する。なお、説明は以下の順序で行う。

１．実施の形態
単一のインデックスが用いられた例
２．変形例
複数のインデックスが用いられた例
３．応用例
DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the invention will be described in detail with reference to the drawings. The description will be given in the following order.

1. 1. Example in which a single index is used 2. Modified example An example in which a plurality of indexes are used. Application examples

＜１．実施の形態＞
[構成]
図１は、本発明による一実施の形態に係る文書検索システム１００の機能ブロックの一例を表したものである。文書検索システム１００は、例えば、図１に示したように、文書格納部１１０、インデックス登録部１２０、インデックス１３０、検索条件入力部１４０、検索部１５０、マージ部１６０および検索結果表示部１７０を備えている。 <1. Embodiment>
[Constitution]
FIG. 1 shows an example of functional blocks of a document search system 100 according to an embodiment of the present invention. The document search system 100 includes, for example, a document storage unit 110, an index registration unit 120, an index 130, a search condition input unit 140, a search unit 150, a merge unit 160, and a search result display unit 170, as shown in FIG. ing.

文書格納部１１０は、検索対象の文書群を格納するものである。文書格納部１１０は、例えば、ＮＡＳ（Network Attached Storage）等の、ネットワーク経由でアクセスする記憶装置や、バス経由でアクセスするハードディスクなどで構成されている。上述の「検索対象の文書群」とは、インデックス１３０に登録された（つまり、アドレスが既知の）文書群を指している。文書群は、各種エディタで作成された文書の集合である。文書は、例えば、オフィス文書や、学術論文、定期刊行物などである。 The document storage unit 110 stores a document group to be searched. The document storage unit 110 includes, for example, a storage device accessed via a network, such as NAS (Network Attached Storage), or a hard disk accessed via a bus. The “search target document group” mentioned above refers to a document group registered in the index 130 (that is, an address is known). The document group is a set of documents created by various editors. The documents are, for example, office documents, academic papers, periodicals, and the like.

図２は、インデックス登録部１２０の機能ブロックの一例を表したものである。インデックス登録部１２０は、文書格納部１１０内の文書群のインデックスを作成し、登録するものである。インデックス登録部１２０は、ハードウェア（アプリケーション回路）で構成されていてもよいし、または、プログラム（ソフトウェア）のロードされた演算装置で構成されていてもよい。 FIG. 2 shows an example of functional blocks of the index registration unit 120. The index registration unit 120 creates and registers an index of a document group in the document storage unit 110. The index registration unit 120 may be configured with hardware (application circuit), or may be configured with an arithmetic device loaded with a program (software).

インデックス登録部１２０は、まず、文書の一覧を取得し、作成する（ステップＳ１０１、Ｓ１０２）。具体的には、インデックス登録部１２０は、文書格納部１１０内の各文書について、例えば、ファイル名、アドレス、日付およびファイルサイズの情報を取得し、それらを一覧にする。このとき、インデックス登録部１２０は、取得した文書ごとに、１つずつ識別子を付与してもよい。このときの識別子は、文書ごとに固有のものであれば何でもよく、例えば、何らかの数字や記号であってもよい。 First, the index registration unit 120 acquires and creates a list of documents (steps S101 and S102). Specifically, the index registration unit 120 acquires, for example, file name, address, date, and file size information for each document in the document storage unit 110 and lists them. At this time, the index registration unit 120 may assign one identifier for each acquired document. The identifier at this time may be anything as long as it is unique for each document, and may be any number or symbol, for example.

次に、インデックス登録部１２０は、作成した一覧からアドレスを取り出し、文書格納部１１０から、そのアドレスに対応する文書を取得する（ステップＳ１０３）。このとき、インデックス登録部１２０は、過去に作成した一覧を保有している場合には、過去の一覧と、現在の一覧との差分を取り、新しい文書や、更新した文書を検出したときだけ、その文書を文書格納部１１０から取得する。なお、文書の更新は、例えば、日付やファイルサイズなどから判別可能である。さらに、インデックス登録部１２０は、過去の一覧と、現在の一覧との差分を取ったときに、既知の文書が存在しないことを検出したときには、その文書を現在の一覧から削除する。 Next, the index registration unit 120 extracts an address from the created list, and acquires a document corresponding to the address from the document storage unit 110 (step S103). At this time, if the index registration unit 120 has a list created in the past, the index registration unit 120 takes a difference between the past list and the current list, and only when a new document or an updated document is detected, The document is acquired from the document storage unit 110. The update of the document can be determined from, for example, the date and the file size. Furthermore, when the index registration unit 120 detects that there is no known document when the difference between the past list and the current list is obtained, the index registration unit 120 deletes the document from the current list.

次に、インデックス登録部１２０は、取得した各文書に対してページ分割を実施する（ステップＳ１０４）。具体的には、インデックス登録部１２０は、取得した各文書のページごとに、１つずつ識別子を付与する。このときの識別子は、各文書においてページごとに固有のものであれば何でもよく、単なるページ番号でもよいし、何らかの数字や記号であってもよい。 Next, the index registration unit 120 performs page division on each acquired document (step S104). Specifically, the index registration unit 120 assigns one identifier to each page of each acquired document. The identifier at this time may be anything as long as it is unique for each page in each document, may be a simple page number, or may be any number or symbol.

ここで、ページ分割を行う意義について説明する。一般に、ファイル単位で検索を行うと、ファイルサイズの大きな文書や、幅広い情報が記載された文書が優位となる。しかし、そのような文書が常に、ユーザが真に必要とする文書であるとは限らない。特に、複数の検索語を用いた場合には、上記のような文書は、検索語同士が関連し合っていない文書である可能性が高い。検索語同士が関連し合っていない文書は、ユーザが真に必要とする文書ではなく、上位に表示すべき文書ではない。一方、ページ単位で検索を行うと、ファイルサイズや記載幅の広さが文書の優位に影響を与えることがなくなる。さらに、複数の検索語を用いた場合に、１ページ内に全ての検索語が分布しているときには、検索語の文書内での位置を把握していなくても、そのページでは、検索語同士が関連し合っている可能性が極めて高いと考えられる。従って、ページ単位で検索を行うことにより、検索語の文書内での位置を考慮した検索と同等の結果を得ることが可能となる。 Here, the significance of performing page division will be described. In general, when a search is performed on a file basis, a document having a large file size or a document in which a wide range of information is described is superior. However, such a document is not always a document that a user really needs. In particular, when a plurality of search terms are used, there is a high possibility that the documents as described above are documents in which the search terms are not related to each other. A document in which the search terms are not related to each other is not a document that the user really needs and is not a document to be displayed at the top. On the other hand, when a search is performed in units of pages, the file size and the width of the description width do not affect the superiority of the document. Furthermore, when a plurality of search terms are used and all the search terms are distributed within one page, the search terms can be compared with each other even if the position of the search terms in the document is not grasped. Are likely to be related. Accordingly, by performing a search in units of pages, it is possible to obtain a result equivalent to a search that takes into account the position of the search term in the document.

なお、検索語の文書内での位置を考慮するためには、各検索語について、文書格納部１１０内の検索対象となる各文書をｇｒｅｐ型（テキスト総ナメ型）で検索することが必要となる。ｇｒｅｐ型の検索では処理に非常に大きな負荷がかかるため、高速検索を行うことが難しい。一方、ページ単位で検索を行う場合には、そもそも、検索語の文書内での位置情報は必要なく、それゆえ、検索時にｇｒｅｐ型の検索を実行する必要もないので、高速検索を行うことが可能である。 In order to consider the position of the search word in the document, it is necessary to search each document to be searched in the document storage unit 110 by the grep type (text total name type) for each search word. . In a grep type search, a very large load is imposed on the processing, and it is difficult to perform a high speed search. On the other hand, when a search is performed in units of pages, the position information in the document of the search term is not necessary in the first place, and therefore it is not necessary to perform a grep type search at the time of the search. Is possible.

次に、インデックス登録部１２０は、取得した各文書のページごとに、ｎ文字分割（ｎ≧１）を実施する（ステップＳ１０５）。具体的には、インデックス登録部１２０は、取得した各文書のページごとに、文章をｎ文字で切り出す。このとき、文章の文字数がｍの場合には、文章は、（ｍ−（ｎ−１））個の単語に分割される。例えば、「キーワードが入力される。」という１２文字からなる文章を例にとると、この文章は、「キー」，［ーワ］，「ワー」，［ード］，「ドが」，［が入］，「入力」，［力さ］，「され」，［れる］，「る。」という１１個の単語に分割される。 Next, the index registration unit 120 performs n character division (n ≧ 1) for each page of each acquired document (step S105). Specifically, the index registration unit 120 cuts out sentences with n characters for each page of each acquired document. At this time, when the number of characters in the sentence is m, the sentence is divided into (m− (n−1)) words. For example, taking a sentence consisting of 12 characters such as “Keyword is input.”, This sentence is “key”, “-wa”, “wa”, “do”, “do”, [ Is input, ”“ input, ”“ strength, ”“ done, ”“ re, ”and“ ru. ”Are divided into 11 words.

ここで、ｎ文字分割を実施する意義について説明する。一般に、インデックスを作成する方法としては、事前に用意した検索語に対してインデックスを作成する方法と、ｎ文字単位で文章を分割することにより得られた単語に対してインデックスを作成する方法（ｎ−ｇｒａｍ）とがある。本実施の形態で用いている方法は、後者のｎ−ｇｒａｍである。前者の方法では、検索語を事前に用意することが必要となるので、事前に検索語を用意する手間がかかる。一方、ｎ−ｇｒａｍの場合には、検索対象となる文書群があれば単語が自動的に抽出されるので、単語を事前に用意する必要がない。このように、ｎ−ｇｒａｍを適用することで、検索に要する手間を大幅に低減することができる。 Here, the significance of performing n character division will be described. In general, as a method of creating an index, a method of creating an index for a search term prepared in advance, and a method of creating an index for a word obtained by dividing a sentence by n characters (n -Gram). The method used in the present embodiment is the latter n-gram. In the former method, it is necessary to prepare a search word in advance, so it takes time to prepare a search word in advance. On the other hand, in the case of n-gram, since a word is automatically extracted if there is a document group to be searched, it is not necessary to prepare a word in advance. Thus, by applying n-gram, the labor required for the search can be greatly reduced.

次に、インデックス登録部１２０は、分割インデックスを作成する（ステップＳ１０６）。具体的には、インデックス登録部１２０は、文章の分割により得られた単語を、各文書のページごとに分割インデックスに登録し、重複する単語が得られた場合には、その単語の出現数をインクリメントして登録する。従って、分割インデックスには、各文書のページごとに、単語と出現数が対となって登録される。 Next, the index registration unit 120 creates a split index (step S106). Specifically, the index registration unit 120 registers the word obtained by dividing the sentence in the divided index for each page of each document, and when duplicate words are obtained, the number of occurrences of the word is calculated. Increment and register. Therefore, a word and the number of appearances are registered in the division index as a pair for each page of each document.

次に、インデックス登録部１２０は、インデックスをマージし、登録する（ステップＳ１０７、Ｓ１０８）。具体的には、インデックス登録部１２０は、例えば、図３に示したように、分割インデックスを最終的に検索で利用する構造（インデックス構造１２１）に変更する。インデックス構造１２１は、文章の分割により得られた単語ごとの出現頻度が各文書のページごとに関連付けられたものである。インデックス構造１２１は、例えば、文章の分割により得られた単語（単語１２１Ａ）、その単語を含む文書の識別子（ファイルナンバー１２１Ｂ）、その単語を含むページの識別子（ページナンバー１２１Ｃ）、および、その単語の、１ページ内での出現数（出現頻度１２１Ｄ）を対とした構造である。インデックス登録部１２０は、例えば、図４に示したように、文章の分割により得られた単語ごと、および各文書のページごとにインデックス構造１２１を作成し、インデックス１３０に登録する。 Next, the index registration unit 120 merges and registers the indexes (steps S107 and S108). Specifically, for example, as illustrated in FIG. 3, the index registration unit 120 changes the divided index to a structure (index structure 121) that is finally used for search. In the index structure 121, the appearance frequency for each word obtained by dividing a sentence is associated with each page of each document. The index structure 121 includes, for example, a word (word 121A) obtained by dividing a sentence, an identifier of a document including the word (file number 121B), an identifier of a page including the word (page number 121C), and the word The number of appearances in one page (appearance frequency 121D) is a paired structure. For example, as illustrated in FIG. 4, the index registration unit 120 creates an index structure 121 for each word obtained by dividing a sentence and for each page of each document, and registers the index structure 121 in the index 130.

検索条件入力部１４０は、ユーザが入力した検索条件を受け付けるものである。検索条件入力部１４０は、例えば、キーボード、マウス、タッチパネル、マイクなどのデータ入力装置であってもよいし、例えば、ユーザが入力した検索条件をネットワーク経由で受信する通信装置であってもよい。 The search condition input unit 140 receives a search condition input by the user. The search condition input unit 140 may be a data input device such as a keyboard, a mouse, a touch panel, and a microphone, or may be a communication device that receives a search condition input by a user via a network, for example.

図５は、検索部１５０の機能ブロックの一例を表したものである。検索部１５０は、検索条件入力部１４０から入力された検索条件に合う文書を、インデックス１３０に基づいて、文書格納部１１０内の検索対象の文書群から抽出するものである。検索部１５０は、ハードウェア（アプリケーション回路）で構成されていてもよいし、または、プログラム（ソフトウェア）のロードされた演算装置で構成されていてもよい。 FIG. 5 illustrates an example of functional blocks of the search unit 150. The search unit 150 extracts a document that meets the search condition input from the search condition input unit 140 from a search target document group in the document storage unit 110 based on the index 130. The search unit 150 may be configured by hardware (application circuit), or may be configured by an arithmetic device loaded with a program (software).

検索部１５０は、まず、検索条件入力部１４０から与えられた検索条件を解析し、この検索条件に含まれる検索語（キーワード）を抽出する（ステップＳ２０１）。このとき、検索条件には、１つの検索語しか含まれていない場合もあるが、複数の検索語が含まれていることが一般的である。以下の説明では、検索条件に、複数の検索語が含まれているものとする。例えば、入力された検索条件が「キーワードケンサク」となっていた場合には、検索部１５０は、図６に示したように、「キーワード」、「ケンサク」の２語を検索語として抽出する。 First, the search unit 150 analyzes the search condition given from the search condition input unit 140 and extracts a search word (keyword) included in the search condition (step S201). At this time, the search condition may include only one search term, but generally includes a plurality of search terms. In the following description, it is assumed that a plurality of search terms are included in the search condition. For example, when the input search condition is “keyword Kensaku”, the search unit 150 extracts two words “keyword” and “kensaku” as search words as shown in FIG.

次に、検索部１５０は、取得した各検索語に対して、ｎ文字分割を実施する（ステップＳ２０２）。具体的には、検索部１５０は、取得した各検索語をｎ文字単位で切り出す。このとき、切り出す文字数（ｎ）は、インデックス１３０作成時に実施したｎ文字分割の切り出し文字数（ｎ）と同じである。例えば、図６に示したように、「キーワード」は、文字数ｍが５、切り出す文字数（ｎ）が２、切り出す回数Ｎがｍ−（ｎ−１）＝５−（２−１）＝４であることから、検索部１５０は、「キーワード」を「キー」，［ーワ］，「ワー」，［ード］という４個の単語に分割する。また、例えば、図６に示したように、「ケンサク」は、文字数ｍが４、切り出す文字数（ｎ）が２、切り出す回数Ｎがｍ−（ｎ−１）＝４−（２−１）＝３であることから、検索部１５０は、「ケンサク」を「ケン」，［ンサ］，「サク］という３個の単語に分割する。 Next, the search unit 150 performs n character division for each acquired search word (step S202). Specifically, the search unit 150 cuts out each acquired search word in units of n characters. At this time, the number (n) of characters to be extracted is the same as the number (n) of characters to be extracted in the n character division performed when the index 130 is created. For example, as shown in FIG. 6, in the “keyword”, the number of characters m is 5, the number of characters to be extracted (n) is 2, and the number N of extraction is m− (n−1) = 5− (2-1) = 4. Therefore, the search unit 150 divides the “keyword” into four words “key”, “-wa”, “wa”, and “do”. Further, for example, as shown in FIG. 6, “kensaku” has a character number m of 4, a character number (n) to be cut out is 2, and the number N of cuts is m− (n−1) = 4− (2-1) = 3, the search unit 150 divides “kensaku” into three words “ken”, [nusa], and “saku”.

次に、検索部１５０は、インデックス１３０を利用して、各検索語の出現頻度を、インデックス１３０に登録された文書ごとに計算する（ステップＳ２０３）。具体的には、検索部１５０は、まず、インデックス１３０を利用して、各検索語の分割により得られた単語ごとの出現頻度を、インデックス１３０に登録された各文書のページごとに抽出する。例えば、図６に示したように、検索部１５０は、「キー」の出現頻度として、ファイルナンバー５のページ１において１０を取得し、ファイルナンバー８のページ６において４を取得する。 Next, the search unit 150 uses the index 130 to calculate the appearance frequency of each search word for each document registered in the index 130 (step S203). Specifically, first, the search unit 150 uses the index 130 to extract the appearance frequency for each word obtained by dividing each search word for each page of each document registered in the index 130. For example, as illustrated in FIG. 6, the search unit 150 acquires 10 on page 1 of file number 5 and 4 on page 6 of file number 8 as the appearance frequency of “key”.

次に、検索部１５０は、各文書のページごとの抽出により得られた単語ごとの出現頻度を利用して、各検索語の出現頻度を計算する（見積もる）。例えば、図６に示したように、ファイルナンバー５のページ１において、「キー」の出現頻度が１０、「ーワ」の出現頻度が５、「ワー」の出現頻度が８、「ード」の出現頻度が２となっていることから、検索部１５０は、これらの最小値である２をファイルナンバー５（またはファイルナンバー５のページ１）における「キーワード」の出現頻度（ｆ５（キーワード））とする。同様の方法を用いることで、検索部１５０は、１をファイルナンバー８における「キーワード」の出現頻度（ｆ８（キーワード））とする。同様に、検索部１５０は、１３をファイルナンバー５における「ケンサク」の出現頻度（ｆ５（ケンサク））とし、１６をファイルナンバー８における「ケンサク」の出現頻度（ｆ８（ケンサク））とする。 Next, the search unit 150 calculates (estimates) the appearance frequency of each search word by using the appearance frequency for each word obtained by extracting each page of each document. For example, as shown in FIG. 6, in page 1 of file number 5, the appearance frequency of “key” is 10, the appearance frequency of “-wa” is 5, the appearance frequency of “war” is 8, and “do” Therefore, the search unit 150 uses the minimum value of 2 as the occurrence frequency of “keyword” in the file number 5 (or page 1 of the file number 5) (f5 (keyword)). And By using the same method, the search unit 150 sets 1 to the appearance frequency (f8 (keyword)) of the “keyword” in the file number 8. Similarly, the search unit 150 sets 13 as the appearance frequency (f5 (kensaku)) of “kensaku” in the file number 5 and 16 as the appearance frequency (f8 (kensaku)) of “kensaku” in the file number 8.

次に、検索部１５０は、インデックス１３０を利用して、各検索語の汎用度を計算する（ステップＳ２０４）。ここで、汎用度とは、文書格納部１１０内の検索対象の文書群における分布の度合いを意味している。汎用度は、文書格納部１１０内の検索対象の文書群のうち、検索語が含まれる文書の数（いわゆるヒット数）または割合に相当する概念である。汎用度の大きな語は、文書格納部１１０内の検索対象の文書群に広く分布している語であり、検索時の絞込みに十分な効果の無い語である。 Next, the search part 150 calculates the versatility of each search term using the index 130 (step S204). Here, the general-purpose degree means the degree of distribution in the document group to be searched in the document storage unit 110. The versatility is a concept corresponding to the number of documents (so-called hit number) or the ratio of documents including the search word in the document group to be searched in the document storage unit 110. Words with a high degree of versatility are words that are widely distributed in the search target document group in the document storage unit 110, and are words that are not sufficiently effective for narrowing down the search.

ここで、各検索語の汎用度を計算する意義について説明する。一般的な検索システムでは、入力された検索語を多く含む文書が上位に表示される「キーワード順」が適用されている。しかし、そのような検索システムでは、入力された検索語に、汎用語と非汎用語とが含まれている場合には、汎用語を多く含む文書が上位に表示され、非汎用語を含む文書が下位に表示される。このとき、ユーザが真に必要とする文書は、非汎用語を多く含む文書であると思われるが、上記の検索システムでは、そのような文書は汎用語に邪魔されて下位に埋没してしまう。このことから、検索結果の表示に際して、検索語の汎用度を考慮することが、ユーザが真に欲する文書を上位に表示する上で特に重要であることがわかる。 Here, the significance of calculating the versatility of each search term will be described. In a general search system, “keyword order” is applied in which documents including many input search terms are displayed at the top. However, in such a search system, if the input search word includes general-purpose words and non-generic words, a document containing many general-purpose words is displayed at the top, and a document containing non-generic words is displayed. Is displayed at the bottom. At this time, the document that the user really needs seems to be a document containing many non-generic words, but in the above search system, such a document is obstructed by the generic words and buried in the lower level. . From this, it can be seen that it is particularly important to consider the versatility of the search term when displaying the search result, in order to display the document that the user really wants at the top.

汎用度は、例えば、文書格納部１１０内の検索対象の文書群において、検索語が含まれる文書を検索することよって得られたヒット数であってもよいし、検索語の出現頻度が１以上となる文書の数であってもよい。汎用度として、検索語の出現頻度が１以上となる文書の数を適用する場合には、検索部１５０は、各検索語の出現頻度を計算する際に、出現頻度が１以上となる文書の数をカウントしておき、その結果得られたカウント数を汎用度として用いることが可能である。つまり、汎用度として、検索語の出現頻度が１以上となる文書の数を適用した場合には、検索部１５０は、各検索語の汎用度を計算するために、文書格納部１１０内の検索対象の文書群全体をわざわざ検索する必要がない。 The versatility may be, for example, the number of hits obtained by searching a document including a search word in a document group to be searched in the document storage unit 110, and the appearance frequency of the search word is 1 or more. May be the number of documents. In the case of applying the number of documents in which the appearance frequency of the search word is 1 or more as the versatility, the search unit 150 calculates the appearance frequency of each search word in the document having the appearance frequency of 1 or more. It is possible to count the number and use the count number obtained as a result as the versatility. That is, when the number of documents in which the appearance frequency of the search word is 1 or more is applied as the general degree, the search unit 150 searches the document storage unit 110 to calculate the general degree of each search word. There is no need to search the entire target document group.

次に、検索部１５０は、文書の重み付けを行う（ステップＳ２０５）。具体的には、検索部１５０は、まず、各検索語を含む文書をリストアップする。例えば、検索部１５０は、検索語の出現頻度が１以上となる文書（もしくはその文書の識別子）、または、出現頻度が１以上となる文書（もしくはその文書の識別子）をリストアップする。次に、検索部１５０は、各検索語の出現頻度および汎用度を利用して、リストアップされた各文書の、検索語ごとの重みを計算する。 Next, the search unit 150 performs document weighting (step S205). Specifically, the search unit 150 first lists documents including each search term. For example, the search unit 150 lists a document (or an identifier of the document) in which the appearance frequency of the search word is 1 or more, or a document (or an identifier of the document) in which the appearance frequency is 1 or more. Next, the search unit 150 calculates the weight for each search word of each document listed using the appearance frequency and the versatility of each search word.

各検索語の出現頻度をｆｄ（ｋｅｙ）とし、各検索語の汎用度をＶ（ｋｅｙ）とし、文書格納部１１０内の検索対象の文書群の文書数をＭとすると、検索部１５０は、リストアップされた各文書の、検索語ごとの重みを、例えば、図６に示したように、ｆｄ（ｋｅｙ）×Ｍ／Ｖ（ｋｅｙ）を用いて求める。さらに、検索部１５０は、リストアップされた各文書の重みＣｏｓｔ（ｄ）を、例えば、図６に示したように、Σ（ｆｄ（ｋｅｙ）×Ｍ／Ｖ（ｋｅｙ））を用いて求める。なお、式中のｄは、ファイルナンバーである。例えば、図６に示したように、検索部１５０は、上記の式を用いることにより、ファイルナンバー５の重みＣｏｓｔ（５）として２７．６を取得し、ファイルナンバー８の重みＣｏｓｔ（８）として２５．２を取得する。つまり、図６の例では、「キーワード」「ケンサク」において、ファイルナンバー５の重みＣｏｓｔ（５）は、ファイルナンバー８の重みＣｏｓｔ（８）よりも大きくなっている。 When the appearance frequency of each search word is fd (key), the general-purpose degree of each search word is V (key), and the number of documents in the document group to be searched in the document storage unit 110 is M, the search unit 150 For example, as shown in FIG. 6, the weight of each listed document for each search word is obtained using fd (key) × M / V (key). Further, the search unit 150 obtains the weight Cost (d) of each listed document using, for example, Σ (fd (key) × M / V (key)) as shown in FIG. In the formula, d is a file number. For example, as illustrated in FIG. 6, the search unit 150 acquires 27.6 as the weight Cost (5) of the file number 5 and uses the above formula as the weight Cost (8) of the file number 8. Get 25.2. In other words, in the example of FIG. 6, the weight Cost (5) of the file number 5 is larger than the weight Cost (8) of the file number 8 in “keyword” and “kensaku”.

ここで、図６に示したように、「ケンサク」の出現頻度は「キーワード」の出現頻度よりも一桁も大きくなっている。そのため、一般的な「ランキング順」とした場合には、ファイルナンバー８の評価値は、ファイルナンバー５の評価値よりも大きくなるはずである。従って、この場合には、「ケンサク」の出現頻度の大きな文書（ファイルナンバー８の文書）が上位に表示され、「キーワード」の出現頻度の大きな文書（ファイルナンバー５の文書）が下位に表示される。一方、図６の例では、ファイルナンバー５の重みＣｏｓｔ（５）が、ファイルナンバー８の重みＣｏｓｔ（８）よりも大きくなっている。そのため、図６の例では、「キーワード」の出現頻度の大きな文書（ファイルナンバー５の文書）が上位に表示され、「ケンサク」の出現頻度の大きな文書（ファイルナンバー８の文書）が下位に表示される。このように、汎用度を用いることにより、汎用語に邪魔されて下位に埋没してしまうような文書を、上位に表示することが可能となる。 Here, as shown in FIG. 6, the appearance frequency of “kensaku” is an order of magnitude higher than the appearance frequency of “keyword”. Therefore, when the general “ranking order” is used, the evaluation value of the file number 8 should be larger than the evaluation value of the file number 5. Therefore, in this case, a document with a high occurrence frequency of “kensaku” (a document with file number 8) is displayed at the top, and a document with a high appearance frequency of “keyword” (a document with file number 5) is displayed at the bottom. The On the other hand, in the example of FIG. 6, the weight Cost (5) of the file number 5 is larger than the weight Cost (8) of the file number 8. Therefore, in the example of FIG. 6, a document with a high occurrence frequency of “keyword” (file number 5 document) is displayed at the top, and a document with a high appearance frequency of “kensaku” (file number 8 document) is displayed at the bottom. Is done. In this way, by using the versatility, it is possible to display a document that is obstructed by a generic word and buried in the lower level at the upper level.

マージ部１６０は、検索部１５０で得られたＣｏｓｔ（ｄ）を利用して、各文書のランキングを決定するものである。マージ部１６０は、ハードウェア（アプリケーション回路）で構成されていてもよいし、または、プログラム（ソフトウェア）のロードされた演算装置で構成されていてもよい。マージ部１６０は、Ｃｏｓｔ（ｄ）の大きな文書から順にソートするようになっている。このとき、マージ部１６０は、ソートされた各文書についての所定の情報や、ヒット件数などを収集する。マージ部１６０は、例えば、ソートされた各文書のファイル名、作成日、検索語が含まれるページを含む複数ページのレイアウト情報（例えば画像データ）、トップページ（１ページ目）のレイアウト情報（例えば画像データ）、ヒット件数を取得する。このとき、マージ部１６０は、検索語の出現頻度が最大となるページを含む複数ページのレイアウト情報（例えば画像データ）を取得することが好ましい。 The merging unit 160 uses the Cost (d) obtained by the search unit 150 to determine the ranking of each document. The merging unit 160 may be configured by hardware (application circuit), or may be configured by an arithmetic device loaded with a program (software). The merge unit 160 sorts documents in descending order of Cost (d). At this time, the merging unit 160 collects predetermined information about each sorted document, the number of hits, and the like. The merge unit 160, for example, layout information (for example, image data) of a plurality of pages including pages including the file name, creation date, and search word of each sorted document, and layout information (for example, the first page) of the top page (first page). Image data), get the number of hits. At this time, it is preferable that the merging unit 160 obtains layout information (for example, image data) of a plurality of pages including a page where the appearance frequency of the search word is maximum.

マージ部１６０は、必要に応じて、ソート情報（文書の並び順についての情報）と、収集した情報（ファイル名等）とを所定の記憶領域に格納する。ここで、所定の記憶領域とは、検索結果表示部１７０が検索結果をディスプレイに表示させる際にアクセスする領域を指している。なお、マージ部１６０は、必要に応じて、ソート情報と、収集した情報とを直接、検索結果表示部１７０に渡してもよい。 The merge unit 160 stores sort information (information about document arrangement order) and collected information (file names, etc.) in a predetermined storage area as necessary. Here, the predetermined storage area refers to an area to be accessed when the search result display unit 170 displays the search result on the display. Note that the merging unit 160 may directly pass the sort information and the collected information to the search result display unit 170 as necessary.

検索結果表示部１７０は、マージ部１６０で決定されたランキングに従って、各文書を画面に表示させるものである。検索結果表示部１７０は、ハードウェア（アプリケーション回路）で構成されていてもよいし、または、プログラム（ソフトウェア）のロードされた演算装置で構成されていてもよい。検索結果表示部１７０は、まず、例えば、所定の記憶領域に格納された情報（ソート情報等）を取得する。なお、検索結果表示部１７０は、所定の記憶領域に格納された情報（ソート情報等）を収集する代わりに、マージ部１６０から直接、ソート情報等を取得してもよい。検索結果表示部１７０は、例えば、ウェブブラウザからなる。次に、検索結果表示部１７０は、取得した情報に基づいて、各文書を画面に表示させる。 The search result display unit 170 displays each document on the screen according to the ranking determined by the merge unit 160. The search result display unit 170 may be configured by hardware (application circuit) or may be configured by an arithmetic device loaded with a program (software). The search result display unit 170 first acquires information (sort information or the like) stored in a predetermined storage area, for example. Note that the search result display unit 170 may acquire sort information or the like directly from the merge unit 160 instead of collecting information (sort information or the like) stored in a predetermined storage area. The search result display unit 170 is composed of, for example, a web browser. Next, the search result display unit 170 displays each document on the screen based on the acquired information.

図７、図８、図９は、検索結果表示部１７０が画面に表示させた検索結果のレイアウトの一例を表したものである。例えば、画面の上部に、検索窓１７１および検索ボタン１７２が配置されており、画面の左脇に、表示形態を選択するボタン（ファイル１７３、ページ１７４、サムネイル１７５）が配置されている。さらに、画面の中央に、ソート情報に基づいて、ファイル名、作成日、および１または複数ページのレイアウト情報（例えば画像データ）が配置されている。 7, 8, and 9 show examples of the layout of the search results displayed on the screen by the search result display unit 170. FIG. For example, a search window 171 and a search button 172 are arranged at the top of the screen, and buttons (file 173, page 174, thumbnail 175) for selecting a display form are arranged on the left side of the screen. Further, in the center of the screen, a file name, a creation date, and layout information (for example, image data) of one or a plurality of pages are arranged based on the sort information.

検索結果表示部１７０は、例えば、図７に示したように、検索語が含まれるページを含む複数ページのレイアウト情報（例えば画像データ）を文書ごとに、横一列に配列させる。このように、検索結果を表示する際に、ファイル名や作成日だけでなく、ページのレイアウトを表示することにより、ユーザは、ページのレイアウトを見ながら文書を探すことができる。さらに、複数ページのレイアウトを画面内に一挙に表示することにより、ユーザは、複数ページのレイアウトを一度に見渡すことができるので、検索語を含む文章の周辺にある非文字情報（例えば図や表、式、写真など）を手がかりに、所望の文書を探し出すことも可能となる。例えば、「３ページ目あたりに図が入っている文章を探したい」といった場合に、ユーザは、図の周辺に書かれていると予測される単語を検索語として入力することで、所望の文書を探し出すことも可能となる。 For example, as illustrated in FIG. 7, the search result display unit 170 arranges layout information (for example, image data) of a plurality of pages including pages including the search word in a horizontal row for each document. As described above, when displaying the search result, not only the file name and the creation date but also the page layout is displayed, so that the user can search for a document while viewing the page layout. Furthermore, by displaying the layout of a plurality of pages all at once on the screen, the user can overlook the layout of the plurality of pages at a time, so that non-character information (for example, a figure or a table) around a sentence including a search term can be viewed. , A formula, a photograph, etc.), and a desired document can be found out. For example, in the case of “I want to search for a sentence containing a figure on the third page”, the user inputs a word that is predicted to be written around the figure as a search word, thereby obtaining a desired document. It is also possible to find out.

また、検索結果表示部１７０は、例えば、図８に示したように、検索語が含まれるページのレイアウト情報（例えば画像データ）を文書ごとに、１ページずつ表示させる。このとき、検索結果表示部１７０は、検索語の出現頻度が最大となるページのレイアウト情報（例えば画像データ）を文書ごとに、１ページずつ表示させていることが好ましい。ページのレイアウトを文書ごとに１ページずつ表示するようにした場合にも、ユーザは、ページのレイアウトを見ながら文書を探すことができる。従って、上記の場合よりは一度に見ることのできるページ数が少ないものの、検索語を含む文章の周辺にある非文字情報（例えば図や表、式、写真など）を手がかりに、所望の文書を探し出すことが可能となる。 In addition, for example, as illustrated in FIG. 8, the search result display unit 170 displays page layout information (for example, image data) including a search word one page at a time for each document. At this time, it is preferable that the search result display unit 170 displays page layout information (for example, image data) that maximizes the appearance frequency of the search word for each document page by page. Even when the page layout is displayed one page at a time for each document, the user can search for the document while viewing the page layout. Therefore, although the number of pages that can be viewed at a time is smaller than in the above case, a desired document can be obtained by using non-character information (for example, a figure, a table, a formula, a photograph, etc.) around the sentence including the search word as a clue. It is possible to find out.

また、検索結果表示部１７０は、例えば、図９に示したように、検索語が含まれる文書のトップページのレイアウト情報（例えば画像データ）を文書ごとに、１ページずつ表示させる。この場合には、ユーザは、非常に多くの文書のトップページのレイアウトを一度に見渡すことができるので、たくさんのトップページを見ながら文書を探すことができる。 Further, for example, as shown in FIG. 9, the search result display unit 170 displays the layout information (for example, image data) of the top page of the document including the search word one page at a time for each document. In this case, the user can look over the layout of the top page of a very large number of documents at a time, so that the user can search for a document while viewing a large number of top pages.

ところで、検索結果表示部１７０は、検索語が含まれるページと、検索語が含まれないページのレイアウトを同時に画面に表示する際には、検索語が含まれるページと、検索語が含まれないページとを視覚的に区別できるようにしてもよい。例えば、図７に示したように、検索結果表示部１７０は、検索語が含まれるページの縁１７６をハイライト表示してもよい。また、検索結果表示部１７０は、検索結果として表示させた文書を選択的に取り出し、それを別個に保存するようにしてもよい。例えば、図７、図８、図９に示したように、検索結果として表示させた各文書の脇に、取り出し用のアイコン１７７を表示させ、そのアイコン１７７がユーザによって選択されたときに、そのアイコン１７７に対応する文書を別個に保存するようにしてもよい。 By the way, the search result display unit 170 does not include the page including the search word and the search word when simultaneously displaying the layout of the page including the search word and the layout of the page not including the search word on the screen. The page may be visually distinguished. For example, as illustrated in FIG. 7, the search result display unit 170 may highlight the edge 176 of the page including the search word. Further, the search result display unit 170 may selectively take out the document displayed as the search result and store it separately. For example, as shown in FIGS. 7, 8, and 9, when a retrieval icon 177 is displayed beside each document displayed as a search result, and the icon 177 is selected by the user, The document corresponding to the icon 177 may be saved separately.

［効果］
次に、本実施の形態の文書検索システム１００の効果について説明する。 [effect]
Next, the effect of the document search system 100 of this embodiment will be described.

本実施の形態では、検索対象の文書群から得られたｎ文字単位の単語ごとの出現頻度がページごとに登録されたインデックス１３０を利用して、各検索語の汎用度が計算される。このように、本実施の形態では、各検索語の汎用度が導出されるので、汎用度を考慮したランキング表示が可能になる。また、各検索語の汎用度が計算により導出されるので、ユーザが、検索条件として入力する言葉が汎用語であるか否かを気にする必要がなくなる。また、本実施の形態では、インデックス１３０を利用して、文書よりも小さな形式区切りであるページごとに各検索語の出現頻度が計算される。これにより、単に網羅的な記載がなされているだけで、検索語同士が関連し合っていない文書が上位にランキングされるのを防ぐことができる。従って、汎用語の影響を低減するとともに、検索語同士が関連し合った文章を含む文書を上位に表示することができる。 In the present embodiment, the versatility of each search word is calculated using the index 130 in which the appearance frequency for each word in units of n characters obtained from the document group to be searched is registered for each page. As described above, in this embodiment, since the versatility of each search term is derived, ranking display considering the versatility is possible. In addition, since the general degree of each search word is derived by calculation, the user does not need to worry about whether or not the word input as the search condition is a general word. In the present embodiment, using the index 130, the appearance frequency of each search word is calculated for each page that is a format break smaller than the document. As a result, it is possible to prevent a document in which search terms are not related to each other from being ranked high by simply making an exhaustive description. Accordingly, it is possible to reduce the influence of the general-purpose words and display a document including sentences in which the search terms are related to each other.

また、本実施の形態では、ページ単位で検索が行われているので、ファイルサイズや記載幅の広さが文書の優位に影響を与えることがなくなる。さらに、複数の検索語を用いた場合に、１ページ内に全ての検索語が分布しているときには、検索語の文書内での位置を把握していなくても、そのページでは、検索語同士が関連し合っている可能性が極めて高いと考えられる。従って、ページ単位で検索を行うことにより、検索語の文書内での位置を考慮した検索と同等の結果を得ることができる。さらに、ページ単位で検索を行う場合には、そもそも、検索語の文書内での位置情報は必要く、それゆえ、検索時にｇｒｅｐ型の検索を実行する必要もない。従って、高速検索を行うことが可能である。 In the present embodiment, since the search is performed in units of pages, the file size and the width of the description width do not affect the superiority of the document. Furthermore, when a plurality of search terms are used and all the search terms are distributed within one page, the search terms can be compared with each other even if the position of the search terms in the document is not grasped. Are likely to be related. Therefore, by performing a search in units of pages, it is possible to obtain a result equivalent to a search that considers the position of the search term in the document. Further, when searching in page units, position information in the document of the search word is not necessary in the first place, and therefore it is not necessary to execute a grep type search at the time of searching. Therefore, high-speed search can be performed.

また、本実施の形態において、図７の例では、検索語が含まれるページを含む複数ページのレイアウト情報（例えば画像データ）が文書ごとに、横一列に配列されている。これにより、ユーザは、ページのレイアウトを見ながら文書を探すことができる。さらに、複数ページのレイアウトを一挙に表示することにより、ユーザは、複数ページのレイアウトを一度に見渡すことができるので、検索語を含む文章の周辺にある非文字情報（例えば図や表、式、写真など）を手がかりに、所望の文書を探し出すことも可能となる。 Further, in the present embodiment, in the example of FIG. 7, layout information (for example, image data) of a plurality of pages including a page including a search word is arranged in a horizontal row for each document. Thereby, the user can search for a document while looking at the layout of the page. Furthermore, by displaying the layout of a plurality of pages at a time, the user can look over the layout of the plurality of pages at once, so that non-character information (for example, a diagram, table, formula, It is also possible to search for a desired document using a photograph or the like as a clue.

＜２．変形例＞
［第１変形例］
上記実施の形態では、ｎ文字分割における分割の単位（ｎ文字単位）が、２文字であったが、１文字であってもよいし、３文字以上であってもよい。ただし、ｎ文字単位があまり大きくなると、ｎ文字単位が検索語の文字数と同一となったり、検索語の文字数よりも大きくなってしまったりすることもあるので、ｎ文字単位は検索語の文字数の統計的な平均値と同等か、それよりも小さいことが好ましい。例えば、日本語の文字数の統計的な平均値は２．３文字であるので、検索語として日本語が用いられる場合には、ｎ文字単位は２文字または３文字であることが好ましい。さらに、例えば、日本語の検索精度をより高めたい場合には、ｎ文字単位が、日本語の文字数の統計的な平均値に近い２文字および３文字だけでなく、１文字も含んでいることが好ましい。また、例えば、英語の文字数の統計的な平均値は５文字であるので、検索語として英語が用いられる場合には、ｎ文字単位は５文字であることが好ましい。 <2. Modification>
[First Modification]
In the above embodiment, the division unit (n character unit) in n character division is two characters, but it may be one character or three or more characters. However, if the n character unit becomes too large, the n character unit may be the same as the number of characters in the search word or may be larger than the number of characters in the search word. It is preferably equal to or smaller than the statistical average value. For example, since the statistical average value of the number of characters in Japanese is 2.3 characters, when Japanese is used as a search word, the n character unit is preferably 2 characters or 3 characters. Furthermore, for example, in order to further improve the Japanese search accuracy, the n character unit should include not only two and three characters close to the statistical average value of the number of Japanese characters but also one character. Is preferred. Further, for example, since the statistical average value of the number of English characters is 5 characters, when English is used as a search word, the n character unit is preferably 5 characters.

［第２変形例］
また、上記実施の形態および第１変形例では、ｎ文字分割における分割の単位（ｎ文字単位）が１種類となっていたが、複数種類であってもよい。図１０は、ｎ文字単位が複数種類となっているときの文書検索システム１００の一例を表したものである。例えば、図１０に示したように、ｎ文字単位が、ｋ１文字単位（ｋ１≧１）、ｋ２文字単位（ｋ２＞ｋ１）、およびｋ３文字単位（ｋ３＞ｋ２）の３種類となっていてもよい。 [Second Modification]
Moreover, in the said embodiment and the 1st modification, although the unit (n character unit) of the division | segmentation in n character division | segmentation became one type, multiple types may be sufficient. FIG. 10 shows an example of the document search system 100 when there are a plurality of types of n character units. For example, as shown in FIG. 10, there are three types of n character units: k1 character units (k1 ≧ 1), k2 character units (k2> k1), and k3 character units (k3> k2). Good.

このとき、インデックス登録部１２０は、ページ分割Ｓ１０４からインデックスのマージＳ１０７までの手順を文字単位の種類ごとに行うことが必要となる。例えば、図１１の例では、インデックス登録部１２０は、ページ分割Ｓ１０４からインデックスのマージＳ１０７までの手順を、ｋ１字単位、ｋ２文字単位およびｋ３文字単位ごとに行う。さらに、インデックス登録部１２０は、文字単位の種類ごとにインデックスを登録することが必要となる。例えば、図１１の例では、インデックス登録部１２０は、ｋ１字単位、ｋ２文字単位およびｋ３文字単位ごとに、インデックスを登録する。従って、インデックス１３０は、文字単位ごとに存在することなる。例えば、図１１の例では、インデックス１３０は、ｋ１字単位、ｋ２文字単位およびｋ３文字単位ごとに存在する。 At this time, the index registration unit 120 needs to perform the procedure from page division S104 to index merging S107 for each character type. For example, in the example of FIG. 11, the index registration unit 120 performs the procedure from page division S104 to index merging S107 for each k1 character unit, k2 character unit, and k3 character unit. Furthermore, the index registration unit 120 needs to register an index for each character type. For example, in the example of FIG. 11, the index registration unit 120 registers an index for each k1 character unit, k2 character unit, and k3 character unit. Therefore, the index 130 exists for each character unit. For example, in the example of FIG. 11, the index 130 exists for each k1 character unit, k2 character unit, and k3 character unit.

さらに、検索部１５０は、ｎ文字分割Ｓ２０２から文書の重み付けＳ２０５までの手順を文字単位の種類ごとに行うことが必要となる。例えば、図１２の例では、検索部１５０は、ｎ文字分割Ｓ２０２から文書の重み付けＳ２０５までの手順をｋ１字単位、ｋ２文字単位およびｋ３文字単位ごとに行う。マージ部１６０は、各文字単位での文書の重み付けの中から、最も適切な重み付けを選択する。 Further, the search unit 150 needs to perform the procedure from the n character division S202 to the document weighting S205 for each character type. For example, in the example of FIG. 12, the search unit 150 performs the procedure from the n character division S202 to the document weighting S205 for each k1, character, and k3 character. The merging unit 160 selects the most appropriate weight from the document weights for each character.

なお、ｎ文字単位は、入力され得る検索語の言語の文字数の統計的な平均値に近い文字単位を含んでいることが好ましい。例えば、検索語として日本語と英語が用いられる場合、ｎ文字単位が、日本語の文字数の統計的な平均値に近い２文字および３文字と、英語の文字数の統計的な平均値に近い５文字とを含んでいることが好ましい。さらに、例えば、日本語の検索精度をより高めたい場合には、ｎ文字単位が、１文字、２文字、３文字および５文字を含んでいることが好ましい。 The n character unit preferably includes a character unit close to a statistical average value of the number of characters in the language of the search term that can be input. For example, when Japanese and English are used as search terms, the n character unit is 2 and 3 characters close to the statistical average value of the number of Japanese characters, and 5 close to the statistical average value of the number of English characters. It is preferable that it contains a character. Furthermore, for example, when it is desired to further improve the Japanese search accuracy, the n character unit preferably includes one character, two characters, three characters, and five characters.

［第３変形例］
上記実施の形態およびその変形例では、文書やインデックスなどを管理する際の形式区切りの単位をページとしていたが、本発明はそれに限定されるものではなく、例えば、段落、章、または節であってもよい。ページ、段落、章、および節は、特定の内容がまとまった領域となっており、形式的な文書構造マーカとしての役割を有している。従って、文書を、ページ、段落、章、または節で区切ることにより、文書を意味内容ごとに区切ることが可能となる。 [Third Modification]
In the above embodiment and its modifications, the unit of format separation when managing documents and indexes is a page. However, the present invention is not limited to this, and is, for example, a paragraph, chapter, or section. May be. Pages, paragraphs, chapters, and sections are groups of specific contents and serve as formal document structure markers. Therefore, by separating the document by pages, paragraphs, chapters, or sections, the document can be separated by semantic content.

文書の形式区切りとして、段落、章、または節を用いる場合には、上記実施の形態およびその変形例において「ページ」を「段落、章、または節」に読み替えればよい。例えば、上記実施の形態およびその変形例において、インデックス登録部１２０は、取得した各文書に対して、段落、章、または節の単位で分割を実施し、取得した各文書の段落、章、または節ごとに、ｎ文字分割を実施してもよい（ステップＳ１０４、Ｓ１０５）。また、例えば、上記実施の形態およびその変形例において、インデックス登録部１２０は、文章の分割により得られた単語を、各文書の段落、章、または節ごとに分割インデックスに登録するようにしてもよい。このようにした場合には、分割インデックスには、各文書の段落、章、または節ごとに、単語と出現数が対となって登録される。 When paragraphs, chapters, or sections are used as document format separators, “page” may be read as “paragraphs, chapters, or sections” in the above-described embodiment and its modifications. For example, in the above-described embodiment and its modifications, the index registration unit 120 divides each acquired document in units of paragraphs, chapters, or sections, and acquires the paragraph, chapter, or section of each acquired document. N characters may be divided for each clause (steps S104 and S105). Further, for example, in the above-described embodiment and its modifications, the index registration unit 120 may register the word obtained by dividing the sentence in the divided index for each paragraph, chapter, or section of each document. Good. In this case, a word and the number of appearances are registered in the division index as a pair for each paragraph, chapter, or section of each document.

［第４変形例］
上記実施の形態およびその変形例では、インデックス登録部１２０や検索部１５０がプログラムのロードされた演算装置で構成されている場合が例示されていたが、この場合には、文書検索システム１００は、演算装置にプログラムをロードするための仕組みを備えている。例えば、文書検索システム１００は、インデックス登録部１２０および検索部１５０が実行する内容が記述されたプログラムの記録された読み出し可能な記録媒体から、プログラムを読み出すリーダを備えていてもよい。また、例えば、文書検索システム１００は、上記のプログラムをネットワーク経由で取得する通信システムを備えていてもよい。 [Fourth Modification]
In the above-described embodiment and its modification, the case where the index registration unit 120 and the search unit 150 are configured by an arithmetic device loaded with a program is exemplified. In this case, the document search system 100 A mechanism for loading a program into the arithmetic device is provided. For example, the document search system 100 may include a reader that reads a program from a readable recording medium on which a program describing the contents executed by the index registration unit 120 and the search unit 150 is recorded. For example, the document search system 100 may include a communication system that acquires the above-described program via a network.

＜３．応用例＞
以下、上記実施の形態およびその変形例で説明した文書検索システム１００の応用例について説明する。文書検索システム１００は、図１３に示したような単独の文書検索装置２００に適用することが可能である。また、文書検索システム１００は、図１４に示したように、外部ネットワーク４００を介して端末装置３００から検索条件を文書検索装置２００に入力するシステムに対して応用することが可能である。また、図１５に示したように、外部ネットワーク４００に接続された文書記憶装置５００内の文書群の検索を、外部ネットワーク４００に接続された文書検索装置２００を用いて行うシステムに対して応用することも可能である。また、図１６に示したように、ＬＡＮ６００に接続された文書記憶装置５００内の文書群の検索を、ＬＡＮ６００に接続された文書検索装置２００を用いて行うシステムに対して応用することも可能である。 <3. Application example>
Hereinafter, application examples of the document search system 100 described in the above embodiment and its modifications will be described. The document search system 100 can be applied to a single document search apparatus 200 as shown in FIG. Further, as shown in FIG. 14, the document search system 100 can be applied to a system that inputs search conditions from the terminal device 300 to the document search device 200 via the external network 400. Further, as shown in FIG. 15, the present invention is applied to a system that uses a document search device 200 connected to the external network 400 to search a document group in the document storage device 500 connected to the external network 400. It is also possible. Further, as shown in FIG. 16, it is also possible to apply to a system in which a document group in the document storage device 500 connected to the LAN 600 is searched using the document search device 200 connected to the LAN 600. is there.

図１３に記載の文書検索装置２００は、上記の文書検索システム１００の機能を１つの端末装置で実現したものに相当する。文書検索装置２００は、例えば、図１３に示したように、文書検索装置２００全体を制御する制御部２１０と、制御部２１０によって利用されるデータを格納可能な記憶部２２０と、検索条件の入力を受け付ける入力部２３０と、検索結果を表示する表示部２４０とを備えている。制御部２１０、記憶部２２０、入力部２３０および表示部２４０は、例えば、共通のバス２５０に接続されている。記憶部２２０は、例えば、図１３に示したように、文書検索プログラム２２１、文書格納部２２２およびインデックス２２３を格納している。 A document search apparatus 200 illustrated in FIG. 13 corresponds to a function in which the function of the document search system 100 is realized by a single terminal device. For example, as illustrated in FIG. 13, the document search device 200 includes a control unit 210 that controls the entire document search device 200, a storage unit 220 that can store data used by the control unit 210, and input of search conditions. Input unit 230 and display unit 240 for displaying the search results. The control unit 210, the storage unit 220, the input unit 230, and the display unit 240 are connected to a common bus 250, for example. The storage unit 220 stores, for example, a document search program 221, a document storage unit 222, and an index 223 as shown in FIG.

文書検索プログラム２２１は、インデックス登録部１２０、検索部１５０、マージ部１６０および検索結果表示部１７０で実行される一連の手順をコンピュータに実行させるものである。文書格納部２２２は、文書格納部１１０の一態様に相当する。インデックス２２３は、インデックス１３０の一態様に相当する。文書検索プログラム２２１のロードされた制御部２１０が、インデックス登録部１２０、検索部１５０、マージ部１６０および検索結果表示部１７０の一態様に相当する。 The document search program 221 causes the computer to execute a series of procedures executed by the index registration unit 120, the search unit 150, the merge unit 160, and the search result display unit 170. The document storage unit 222 corresponds to an aspect of the document storage unit 110. The index 223 corresponds to one aspect of the index 130. The control unit 210 loaded with the document search program 221 corresponds to an aspect of the index registration unit 120, the search unit 150, the merge unit 160, and the search result display unit 170.

図１４に記載の検索システムは、外部ネットワーク４００を介して、端末装置３００と文書検索装置２００が接続されたものである。図１４の文書検索装置２００は、図１３の文書検索装置２００において、表示部２４０が省略され、さらに、入力部２３０の代わりに通信部２６０が設けられたものに相当する。通信部２６０は、文書検索装置２００が外部ネットワーク４００を介して端末装置３００と通信することを可能にする装置である。 In the search system illustrated in FIG. 14, a terminal device 300 and a document search device 200 are connected via an external network 400. The document search apparatus 200 in FIG. 14 corresponds to the document search apparatus 200 in FIG. 13 in which the display unit 240 is omitted and a communication unit 260 is provided instead of the input unit 230. The communication unit 260 is a device that enables the document search device 200 to communicate with the terminal device 300 via the external network 400.

端末装置３００は、検索条件の入力を受け付け、受け付けた検索条件を外部ネットワーク４００を介して文書検索装置２００に渡し、検索結果をユーザに提示する装置である。端末装置３００は、例えば、図１４に示したように、端末装置３００全体を制御する制御部３１０と、制御部３１０によって利用されるデータを格納可能な記憶部３２０と、検索条件の入力を受け付ける入力部３３０と、検索結果を表示する表示部３４０と、外部ネットワーク４００を介して文書検索装置２００と通信する通信部３５０とを備えている。制御部３１０、記憶部３２０、入力部３３０、表示部３４０および通信部３５０は、例えば、共通のバス３６０に接続されている。記憶部２２０は、例えば、図示しないが、検索結果表示部１７０で実行される一連の手順をコンピュータに実行させるソフトウェア（例えばウェブブラウザ）を格納している。図１４の文書検索プログラム２２１は、上述のインデックス登録部１２０、検索部１５０およびマージ部１６０で実行される一連の手順をコンピュータに実行させるものである。文書格納部２２２は、文書格納部１１０の一態様に相当する。インデックス２２３は、インデックス１３０の一態様に相当する。文書検索プログラム２２１のロードされた制御部２１０が、上述のインデックス登録部１２０、検索部１５０およびマージ部１６０の一態様に相当する。検索結果表示部１７０で実行される一連の手順をコンピュータに実行させるソフトウェアのロードされた制御部３１０が、検索結果表示部１７０の一態様に相当する。 The terminal apparatus 300 is an apparatus that receives input of search conditions, passes the received search conditions to the document search apparatus 200 via the external network 400, and presents search results to the user. For example, as illustrated in FIG. 14, the terminal device 300 receives a control unit 310 that controls the entire terminal device 300, a storage unit 320 that can store data used by the control unit 310, and input of search conditions. An input unit 330, a display unit 340 that displays search results, and a communication unit 350 that communicates with the document search apparatus 200 via the external network 400 are provided. The control unit 310, the storage unit 320, the input unit 330, the display unit 340, and the communication unit 350 are connected to a common bus 360, for example. For example, although not shown, the storage unit 220 stores software (for example, a web browser) that causes a computer to execute a series of procedures executed by the search result display unit 170. The document search program 221 in FIG. 14 causes a computer to execute a series of procedures executed by the index registration unit 120, the search unit 150, and the merge unit 160 described above. The document storage unit 222 corresponds to an aspect of the document storage unit 110. The index 223 corresponds to one aspect of the index 130. The control unit 210 loaded with the document search program 221 corresponds to an aspect of the index registration unit 120, the search unit 150, and the merge unit 160 described above. A control unit 310 loaded with software that causes a computer to execute a series of procedures executed in the search result display unit 170 corresponds to one mode of the search result display unit 170.

図１５に記載の検索システムは、外部ネットワーク４００を介して、文書検索装置２００と文書記憶装置５００が接続されたものである。図１５の文書検索装置２００は、図１３の文書検索装置２００において、文書格納部２２２が省略され、さらに、通信部２６０が設けられたものに相当する。通信部２６０は、文書検索装置２００が外部ネットワーク４００を介して文書記憶装置５００と通信することを可能にする装置である。 The search system illustrated in FIG. 15 is a system in which a document search device 200 and a document storage device 500 are connected via an external network 400. The document search apparatus 200 in FIG. 15 corresponds to the document search apparatus 200 in FIG. 13 in which the document storage unit 222 is omitted and a communication unit 260 is further provided. The communication unit 260 is a device that enables the document search device 200 to communicate with the document storage device 500 via the external network 400.

文書記憶装置５００は、ネットワーク経由でアクセスする記憶装置である。文書記憶装置５００は、例えば、図１５に示したように、文書記憶装置５００全体を制御する制御部５１０と、制御部５１０によって利用されるデータを格納可能な記憶部５２０と、外部ネットワーク４００を介して文書検索装置２００と通信する通信部３５０とを備えている。 The document storage device 500 is a storage device accessed via a network. For example, as shown in FIG. 15, the document storage device 500 includes a control unit 510 that controls the entire document storage device 500, a storage unit 520 that can store data used by the control unit 510, and an external network 400. And a communication unit 350 that communicates with the document search apparatus 200 via the network.

図１６に記載の検索システムは、ＬＡＮ６００を介して、文書検索装置２００と文書記憶装置５００が接続されたものである。図１６の文書検索装置２００は、図１５の文書検索装置２００において、通信部２６０の代わりに通信部２７０が設けられたものに相当する。通信部２７０は、文書検索装置２００がＬＡＮ６００を介して文書記憶装置５００と通信することを可能にする装置である。 The search system illustrated in FIG. 16 is a system in which a document search device 200 and a document storage device 500 are connected via a LAN 600. The document search apparatus 200 in FIG. 16 corresponds to the document search apparatus 200 in FIG. 15 in which a communication unit 270 is provided instead of the communication unit 260. The communication unit 270 is a device that enables the document search device 200 to communicate with the document storage device 500 via the LAN 600.

図１６の文書記憶装置５００は、図１５の文書記憶装置５００において、通信部５３０の代わりに通信部５５０が設けられたものに相当する。通信部５５０は、文書記憶装置５００がＬＡＮ６００を介して文書検索装置２００と通信することを可能にする装置である。 The document storage device 500 in FIG. 16 corresponds to the document storage device 500 in FIG. 15 in which a communication unit 550 is provided instead of the communication unit 530. The communication unit 550 is a device that enables the document storage device 500 to communicate with the document search device 200 via the LAN 600.

以上のように、文書検索システム１００は、様々な態様の検索システムに応用可能である。 As described above, the document search system 100 can be applied to various types of search systems.

１００…文書検索システム、１１０…文書格納部、１２０…インデックス登録部、１２１…インデックス構造、１２１Ａ…単語、１２１Ｂ…ファイルナンバー、１２１Ｃ…ページナンバー、１２１Ｄ…出現頻度、１３０…インデックス、１４０…検索条件入力部、１５０…検索部、１６０…マージ部、１７０…検索結果表示部、１７１…検索窓、１７２…検索ボタン、１７３…ファイル、１７４…ページ、１７５…サムネイル、１７６…縁、１７７…アイコン、２００…文書検索装置、２１０，３１０，５１０…制御部、２２０，３２０，５２０…記憶部、２２１…文書検索プログラム、２２２…文書格納部、２２３…インデックス、２３０，３３０…入力部、２４０，３４０…表示部、２５０，３６０，５４０…バス、２６０，２７０，３５０，５３０，５５０…通信部、３００…端末装置、５００…文書記憶装置、６００…ＬＡＮ。 DESCRIPTION OF SYMBOLS 100 ... Document search system 110 ... Document storage part 120 ... Index registration part 121 ... Index structure 121A ... Word 121B ... File number 121C ... Page number 121D ... Appearance frequency 130 ... Index 140 ... Search condition Input unit 150... Search unit 160 160 Merge unit 170 Search result display unit 171 Search window 172 Search button 173 File 174 Page 175 Thumbnail 176 Edge 177 Icon 200 ... Document search device, 210, 310, 510 ... Control part, 220, 320, 520 ... Storage part, 221 ... Document search program, 222 ... Document storage part, 223 ... Index, 230, 330 ... Input part, 240, 340 ... Display unit, 250, 360, 540 ... Bus, 260, 270, 3 0,530,550 ... communication unit, 300 ... terminal apparatus, 500 ... document store, 600 ... LAN.

Claims

A dividing unit that analyzes a given search condition and divides each search word included in the search condition in units of n characters (n ≧ 1);
The frequency of appearance for each word obtained by dividing each document in the search target document group by n characters is obtained by dividing each search word by using an index registered for each document format break. An extraction unit that extracts the appearance frequency for each word that is registered for each format break registered in the index;
While using the appearance frequency for each word extracted by the extraction unit, and not using the position information within the format break configured in the document of each search word, A document search system comprising: a weighting unit that calculates a generality of each search word and calculates a weight of each document using the appearance frequency and the generality obtained by the calculation.

The document search system according to claim 1, wherein the format break is a page, a paragraph, a chapter, or a section.

The document search system according to claim 1, further comprising a merge unit that determines a ranking of each document using the weight obtained by the weighting unit.

The document search system according to claim 3, further comprising a search result display unit that displays each document according to the ranking determined by the merge unit.

5. The document search system according to claim 4, wherein the search result display unit displays a continuous layout of a plurality of format delimiters including a format delimiter having the highest appearance frequency in each document.

The document search system according to claim 4, wherein the search result display unit displays a format-separated layout that maximizes the appearance frequency in each document.

The n character unit includes a plurality of character units,
The index includes an index for each character unit included in the n character unit,
The dividing unit divides each search term in units of characters,
The extraction unit uses the index to extract the appearance frequency for each word obtained by the division in the division unit for each format break registered in the index and for each character unit,
The weighting unit uses the appearance frequency for each word obtained by the extraction in the extraction unit to determine the appearance frequency for each format term and each character unit, and the generality of each search word. The document search system according to any one of claims 1 to 6, wherein the weight of each document is calculated using the appearance frequency and the versatility obtained by the calculation.

Analyzing a given search condition and dividing each search word included in the search condition in units of n characters (n ≧ 1);
The frequency of appearance for each word obtained by dividing each document in the search target document group by n characters is obtained by dividing each search word by using an index registered for each document format break. A second step of extracting the appearance frequency for each word obtained for each format break registered in the index;
While using the appearance frequency for each word extracted by the extraction unit, and not using the position information within the format break configured in the document of each search word, And a third step of calculating a general degree of each search word and calculating a weight of each document using the appearance frequency and the general degree obtained by the calculation.