JP2006323575A

JP2006323575A - Document retrieval system, document retrieval method, document retrieval program and recording medium

Info

Publication number: JP2006323575A
Application number: JP2005145378A
Authority: JP
Inventors: Yoshitaka Hamaguchi; 佳孝濱口
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2005-05-18
Filing date: 2005-05-18
Publication date: 2006-11-30

Abstract

<P>PROBLEM TO BE SOLVED: To improve document retrieval processing if a retrieval condition includes at least numerical information. <P>SOLUTION: The document retrieval system comprises a numerical information classification means classifying one or more pieces of numerical information extracted from an input document for each unit kind of numerical information; a characteristic quantity calculation means calculating a characteristic quantity for estimating a document including numerical information in a predetermined numerical range for each unit kind; a characteristic quantity storage means storing the characteristic quantity calculated by the calculation means for each unit kind; a numerical importance calculation means acquiring, based on a unit kind of input numerical information inputted as a retrieval condition, the characteristic quantity of the corresponding unit kind from the characteristic quantity storage means and calculating, based on at least the acquired characteristic quantity, the weight of the input numerical information inputted as the retrieval condition; and a document retrieval means retrieving, based on the weight of input numerical information calculated by the numerical importance calculation means, a document from registered documents. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、文書検索システム、文書検索方法、文書検索プログラム及び記録媒体に関し、特に、入力されたキーワードあるいは文章に数値が検索条件として含まれる場合の文書検索を可能とする文書検索システム、文書検索方法、文書検索プログラム及び記録媒体に適用し得る。 The present invention relates to a document search system, a document search method, a document search program, and a recording medium, and in particular, a document search system and a document search that enable a document search when a numerical value is included as a search condition in an input keyword or sentence. The present invention can be applied to a method, a document search program, and a recording medium.

従来、単数又は複数のキーワードに基づいて文書検索を行なう文書検索システムは、特許文献１に示すように、単語出現頻度ｔｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）や文書出現数ｄｆ（ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）に基づいてキーワードとの適合度を算出し、適合度の高い文書から検索結果として出力していた。 2. Description of the Related Art Conventionally, a document search system that performs document search based on one or a plurality of keywords, as shown in Patent Document 1, uses a keyword appearance frequency tf (Term Frequency) and a document appearance number df (Document Frequency) as a keyword. Relevance was calculated and output as a search result from a document with high relevance.

ここで、単語出現頻度ｔｆは、ある文書におけるキーワードの出現頻度であり、文書出現数ｄｆは、キーワードが出現する文書の数である。 Here, the word appearance frequency tf is the appearance frequency of a keyword in a certain document, and the document appearance number df is the number of documents in which the keyword appears.

このとき、ｄｆはそのキーワードの文書絞込みへの貢献度を測るために用いられる。すなわち、少ない文書にしか出現しないｄｆの小さなキーワードは文書の絞込みに貢献しやすく、ｄｆの大きなキーワード（極端な場合ほとんどの文書に出現するキーワード）は検索での絞り込みにあまり役立たない。したがって、ｄｆが小さい単語ほど適合度の計算に大きな影響を与えるように適合度の算出式を設定するのが一般的である。 At this time, df is used to measure the degree of contribution of the keyword to document narrowing. That is, a keyword with a small df that appears only in a small number of documents is likely to contribute to narrowing down the documents, and a keyword with a large df (a keyword that appears in most documents in extreme cases) is not very useful for narrowing down the search. Therefore, it is common to set the calculation formula of the fitness so that the word having a smaller df has a greater influence on the calculation of the fitness.

また、ｔｆはキーワードを多く含む文書がより検索条件にマッチすることを測るために用いられ、ｔｆが大きいほど文書の適合度が上がるように使われる。 Further, tf is used to measure that a document containing many keywords matches the search condition more, and is used so that the fitness of the document increases as tf increases.

以上のように、文書ごとにｄｆとｔｆとの積を全キーワードについて加算することで、その文書のキーワードに対する適合度を算出し、文書検索を行う。 As described above, the product of df and tf is added for all keywords for each document, thereby calculating the degree of fitness of the document with respect to the keyword and performing document search.

特開２０００−３２２４１６号公報JP 2000-322416 A

ところで、上述した従来の文書検索システムを用いて、数値も検索条件として文書検索をしようとすると、従来の方法では、ｄｆに相当するものを効率良く利用することができず、文書検索に係る処理速度が遅くなるという問題が発生する。以下、それについて説明する。 By the way, using the above-described conventional document search system, if an attempt is made to perform a document search using numerical values as search conditions, the conventional method cannot efficiently use what corresponds to df, and processing related to document search will be performed. The problem of slow speed occurs. This will be described below.

単語のキーワードの場合、文書のＤＢ（データベース）に投入時に、出現単語ごとにその単語が出現する文書数即ちｄｆを計数し、単語と対応付けて予め記憶させておくことができる。そうすれば、文書検索時にはキーワードごとにその値を参照するだけでいい。即ちキーワードが含まれる文書を検索して数える動作を検索時に行う必要はなく、処理速度を向上させることができる。このようなことは、従来一般の検索において行なわれている。 In the case of a word keyword, the number of documents in which the word appears, that is, df, can be counted for each appearing word and stored in advance in association with the word at the time of entering the document DB (database). Then, when searching for documents, it is only necessary to refer to the value for each keyword. That is, it is not necessary to perform an operation for searching and counting documents including keywords at the time of searching, and the processing speed can be improved. Such a thing is performed in the conventional general search.

しかしながら、数値の場合、条件に指定される数値の値や範囲により絞り込まれる文書数は異なる。例えば「１０ｍ」と指定される場合と「１ｍ〜１００ｍ」と指定される場合では前者のほうが適合する文書が少ないなど、検索条件となる数値の値や範囲によりｄｆが変わる。すなわち、これらを反映したｄｆの情報が必要となるのだが、このような数値の検索条件のバリエーションは無数にあり、そのあらゆる数値による検索条件について事前にｄｆを計数・記憶することはできない。したがって、キーワードにおけるような処理速度の向上策を取れない。 However, in the case of numerical values, the number of documents to be narrowed differs depending on the numerical value and range specified in the condition. For example, when “10 m” is specified and when “1 m to 100 m” is specified, df varies depending on the value and range of the search condition, such as fewer documents that match the former. In other words, df information reflecting these is required, but there are innumerable variations of such numerical search conditions, and it is not possible to count and store df in advance for any numerical search condition. Therefore, it is impossible to take measures for improving the processing speed as in the keyword.

このため、検索処理時に、実際に検索を行ない、数値条件の比較を行ない、条件の範囲に収まった文書の数を数えて始めてｄｆが得られる。これは従来のキーワードにおいてｄｆを得るのに比較して検索時の処理が非常に重くなり、検索対象の文書数が大規模であるほど顕著な問題となる。 For this reason, at the time of the search process, df is obtained only after actually performing a search, comparing numerical conditions, and counting the number of documents falling within the range of the conditions. This makes the processing at the time of retrieval very heavy compared to obtaining df for a conventional keyword, and becomes more prominent as the number of documents to be retrieved is larger.

かかる課題を解決するために、第１の本発明の文書検索システムは、少なくとも数値情報を検索条件として登録されている登録文書から検索条件に適合する文書を検索する文書検索システムであって、（１）入力文書から抽出された１又は複数の数値情報のそれぞれを、数値情報の単位種類毎に分類する数値情報分類手段と、（２）数値情報を含む文書を推定させる特徴量を、単位種類毎の所定の数値範囲を伴って算出する特徴量算出手段と、（３）特徴量算出手段が算出した特徴量を単位種類毎に記憶する特徴量記憶手段と、（４）検索条件として入力された入力数値情報の単位種類に基づいて、対応する単位種類の特徴量を特徴量記憶手段から取得し、少なくとも取得した特徴量に基づいて検索条件として入力された入力数値情報の重みを算出する数値重要度算出手段と、（５）数値重要度算出手段が算出した入力数値情報の重みに基づいて登録文書から文書を検索する文書検索手段とを備えることを特徴とする。 In order to solve such a problem, a document search system according to a first aspect of the present invention is a document search system that searches a document that matches a search condition from registered documents registered with at least numerical information as a search condition. 1) Numeric information classification means for classifying each of one or a plurality of numerical information extracted from an input document for each unit type of numerical information, and (2) a feature amount for estimating a document including numerical information is represented by unit type. Feature amount calculating means for calculating with a predetermined numerical range for each, (3) feature amount storing means for storing the feature amount calculated by the feature amount calculating means for each unit type, and (4) input as a search condition Based on the unit type of the input numerical information, the feature amount of the corresponding unit type is acquired from the feature amount storage means, and the weight of the input numerical information input as the search condition based on at least the acquired feature amount And numerical importance calculating means for calculating, characterized in that it comprises a document retrieval means for retrieving a document from the registration document based on the weight of the input numerical value information calculated (5) numerical importance calculating means.

また、第２の本発明の文書検索方法は、少なくとも数値情報を検索条件として登録されている登録文書から検索条件に適合する文書を検索する文書検索方法であって、（１）入力文書から抽出された１又は複数の数値情報のそれぞれを、数値情報の単位種類毎に分類する数値情報分類工程と、（２）数値情報を含む文書を推定させる特徴量を、単位種類毎の所定の数値範囲を伴って算出する特徴量算出工程と、（３）特徴量算出工程で算出した特徴量を単位種類毎に記憶する特徴量記憶手段を有し、（４）検索条件として入力された入力数値情報の単位種類に基づいて、対応する単位種類の特徴量を特徴量記憶手段から取得し、少なくとも取得した特徴量に基づいて検索条件として入力された入力数値情報の重みを算出する数値重要度算出工程と、（５）数値重要度算出工程で算出した入力数値情報の重みに基づいて登録文書から文書を検索する文書検索工程とを備えることを特徴とする
さらに、第３の本発明の文書検索プログラムは、少なくとも数値情報を検索条件として登録されている登録文書から検索条件に適合する文書を検索する文書検索プログラムであって、コンピュータに、（１）入力文書から抽出された１又は複数の数値情報のそれぞれを、数値情報の単位種類毎に分類する数値情報分類手段、（２）数値情報を含む文書を推定させる特徴量を、単位種類毎の所定の数値範囲を伴って算出する特徴量算出手段、（３）特徴量算出手段が算出した特徴量を単位種類毎に記憶する特徴量記憶手段、（４）検索条件として入力された入力数値情報の単位種類に基づいて、対応する単位種類の特徴量を特徴量記憶手段から取得し、少なくとも取得した特徴量に基づいて検索条件として入力された入力数値情報の重みを算出する数値重要度算出手段、（５）数値重要度算出手段が算出した入力数値情報の重みに基づいて登録文書から文書を検索する文書検索手段として機能させるための文書検索プログラムである。 A document search method according to the second aspect of the present invention is a document search method for searching a document that matches a search condition from a registered document registered with at least numerical information as a search condition, and (1) extracted from an input document A numerical information classification step of classifying each of the numerical information pieces that have been performed for each unit type of numerical information, and (2) a feature value for estimating a document that includes the numerical information in a predetermined numerical range for each unit type And (3) feature quantity storage means for storing the feature quantity calculated in the feature quantity calculation process for each unit type, and (4) input numerical value information input as a search condition Based on the unit type, the feature quantity of the corresponding unit type is acquired from the feature quantity storage means, and the weight of the input numerical information input as the search condition is calculated based on at least the acquired feature quantity. And (5) a document search step for searching a document from a registered document based on the weight of the input numerical information calculated in the numerical importance calculation step. Further, the document search program according to the third aspect of the present invention Is a document search program for searching a document that matches a search condition from a registered document registered with at least numerical information as a search condition. (1) One or a plurality of numerical information extracted from an input document Numerical information classification means for classifying each of the above information for each unit type of numerical information, and (2) feature quantity calculation means for calculating a feature quantity for estimating a document containing the numeric information with a predetermined numerical range for each unit type. (3) Feature quantity storage means for storing the feature quantity calculated by the feature quantity calculation means for each unit type, (4) Based on the unit type of the input numerical information input as the search condition, (5) Numerical importance calculation; (5) calculating numerical importance; and (5) calculating the numerical importance, which obtains the weight of input numerical information input as a search condition based on at least the acquired feature A document search program for causing a document search unit to search for a document from a registered document based on the weight of the input numerical information calculated by the unit.

また、第４の本発明の記録媒体は、少なくとも数値情報を検索条件として登録されている登録文書から検索条件に適合する文書を検索する文書検索プログラムを記録したコンピュータに読み取り可能な記録媒体であって、文書検索プログラムが第３の本発明の文書検索プログラムに対応するものであることを特徴とする文書検索プログラムを記録したコンピュータに読み取り可能な記録媒体である。 A recording medium according to the fourth aspect of the present invention is a computer-readable recording medium that records a document search program for searching for a document that matches a search condition from a registered document registered with at least numerical information as a search condition. A computer-readable recording medium having a document search program recorded thereon, wherein the document search program corresponds to the document search program of the third aspect of the present invention.

本発明によれば、文書の登録時に、数値情報の単位種類毎に、数値及び又は数値範囲を含む文書についての特徴量を算出することで、文書登録時に所定の数値情報を含む文書に関する情報を予め求めることができるので、数値及び又は数値範囲を検索条件とした場合でも検索処理に、数値情報を含む文書に関する情報を求めることなく検索することができる。これにより、検索処理に係る処理時間を短くすることができる。 According to the present invention, at the time of document registration, for each unit type of numerical information, by calculating a feature amount for a document including a numerical value and / or a numerical range, information regarding the document including predetermined numerical information at the time of document registration is obtained. Since it can be obtained in advance, even when a numerical value and / or a numerical value range is used as a search condition, the search processing can be performed without obtaining information relating to a document including numerical information. Thereby, the processing time concerning a search process can be shortened.

以下、本発明の文書検索システム、文書検索方法、文書検索プログラム及び記録媒体の実施形態について図面を参照して説明する。 Hereinafter, embodiments of a document search system, a document search method, a document search program, and a recording medium according to the present invention will be described with reference to the drawings.

なお、以下の実施形態における文書検索システムは、コンピュータが実行可能なプログラムとして実現可能なものである。また、このプログラムは、コンピュータに読み取り可能な記録媒体に格納されるものとしてもよく、また装置のハードディスクに格納されるものとしてもよい。 Note that the document search system in the following embodiments can be realized as a computer-executable program. The program may be stored in a computer-readable recording medium, or may be stored in a hard disk of the apparatus.

（Ａ）第１の実施形態
まず、以下、本発明の文書検索システム、文書検索方法、文書検索プログラム及び記録媒体の第１の実施形態について説明する。 (A) First Embodiment First, the first embodiment of the document search system, document search method, document search program, and recording medium of the present invention will be described below.

本実施形態は、登録された複数の文書の中から、入力されたキーワード、数値及び又は数値の範囲を検索条件とし、評価値が高い文書を検索文書として出力する文書検索システムに、本発明の文書検索システム、文書検索方法、文書検索プログラムを適用した場合を説明する。 The present embodiment provides a document search system for outputting a document having a high evaluation value as a search document using a keyword, a numerical value, and / or a range of numerical values as a search condition from a plurality of registered documents. A case where a document search system, a document search method, and a document search program are applied will be described.

本実施形態は、所定の数値区間ごとの出現文書数（ｄｆ）で近似することで、検索条件となる数値の値を反映させるものである。 In the present embodiment, a numerical value serving as a search condition is reflected by approximating the number of appearance documents (df) for each predetermined numerical section.

（Ａ−１）第１の実施形態の構成
図１は、本実施形態に係る文書検索システムの機能を説明する機能ブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a functional block diagram illustrating functions of a document search system according to the present embodiment.

図１に示すように、本実施形態の文書検索システム１は、入力文書の情報をデータベースに投入する文書投入部３０００と、入力文書の情報をデータベースとして記憶する文書記憶部２０００と、入力条件（例えばキーワード等）に適した文書を文書記憶部２０００から検索する文書検索部１０００とを少なくとも有して構成される。 As shown in FIG. 1, the document search system 1 of the present embodiment includes a document input unit 3000 that inputs input document information into a database, a document storage unit 2000 that stores input document information as a database, input conditions ( For example, it includes at least a document search unit 1000 that searches the document storage unit 2000 for a document suitable for a keyword or the like.

まず、文書記憶部２０００の構成について説明する。文書記憶部２０００は、ＤＦデータ記憶部２２００と、数値範囲分布データ記憶部２３００と、文書ＤＢ（データベース）部２４００とを少なくとも有する。 First, the configuration of the document storage unit 2000 will be described. The document storage unit 2000 includes at least a DF data storage unit 2200, a numerical range distribution data storage unit 2300, and a document DB (database) unit 2400.

ＤＦデータ記憶部２２００は、後述する文書数計数部３２００が計数した結果を記憶するものであり、投入された文書のうち見出し語となる単語を含む文書数を示すｄｆデータを、見出し語となる単語毎に記憶するものである。 The DF data storage unit 2200 stores the result counted by the document number counting unit 3200, which will be described later. The df data indicating the number of documents including the word serving as the headword among the input documents is used as the headword. It is memorized for each word.

数値範囲分布データ記憶部２３００は、後述する数値範囲分布統計部３４００が算出した算出結果を記憶するものであり、数値の単位種類毎に予め設定された所定の数値区間の数値を含む文書の数を記憶するものである。 The numerical value range distribution data storage unit 2300 stores the calculation result calculated by the numerical value range distribution statistical unit 3400, which will be described later, and the number of documents including numerical values in a predetermined numerical interval preset for each numerical unit type. Is memorized.

ここで、数値の単位種類毎の所定の数値区間とは、数値が示す単位種類毎に、数値範囲を所定の規則により有限個に区切った区間をいう。例えば、本実施形態では、数値の単位種類の例を「ｂｐｓ」とし、この単位について数値区間を数値の１桁の範囲（１以上１０未満、１０以上１００未満等）を１区間として説明する。 Here, the predetermined numerical section for each unit type of numerical value refers to a section obtained by dividing a numerical range into a finite number by a predetermined rule for each unit type indicated by the numerical value. For example, in this embodiment, an example of the unit type of a numerical value is “bps”, and the numerical value section of this unit is described as a single-digit numerical value range (1 to less than 10, 10 to less than 100, etc.) as one section.

なお、数値の単位種類毎の所定の数値区間は「桁」である必要はなく、例えば、広い範囲をもつ数値の場合には、数値の対数とし、それが一定間隔となるように所定幅に区切るようにしてもよい。このようにして、広い範囲をもつ数値についても有限個の区間に分けることができる。また、数値区間は、数値の単位種類毎にそれぞれ異なる区間を設定することが可能である。 It should be noted that the predetermined numerical interval for each unit type of numerical values does not need to be “digits”. For example, in the case of numerical values having a wide range, the logarithm of the numerical value is set to a predetermined width so that it is a constant interval. You may make it delimit. In this way, numerical values having a wide range can be divided into a finite number of sections. Also, different numerical sections can be set for each numerical unit type.

文書ＤＢ部２４００は、文書投入部３０００に投入された文書（あるいは文書を識別する文書識別情報）毎に、文書に出現した見出し語となる単語及びその出現数（ｔｆ）と、文書に出現した数値情報及びその出現数を記憶するものである。ここで、数値情報とは、数値と単位とを少なくとも有する情報をいい、例えば文書中の数値と単位とを含む文字列を、後述する数値抽出部３３００が抽出した情報をいう。また、以下では、文書から抽出した数値及び単位からなる文字列を所定の数値表現に変換したものも数値情報と示して説明する。なお、数値情報の表現方法は、特に限定されておらず、単位種類などに応じて異なる表現を用いることができる。なお、文書ＤＢ部２４００は、全登録文書における見出し語となる単語や数値情報毎の総出現回数を記憶するようにしてもよい。 For each document (or document identification information for identifying a document) input to the document input unit 3000, the document DB unit 2400 appears in the document as a word that is a headword that appears in the document and the number of occurrences (tf). Numerical information and the number of appearances are stored. Here, the numerical information refers to information having at least a numerical value and a unit, for example, information obtained by extracting a character string including a numerical value and a unit in a document by a numerical value extraction unit 3300 described later. In the following description, a numerical string extracted from a document and a character string made up of units converted into a predetermined numerical expression will be described as numerical information. In addition, the expression method of numerical information is not specifically limited, A different expression can be used according to unit types. Note that the document DB unit 2400 may store a word that becomes a headword in all registered documents and the total number of appearances for each numerical value information.

次に、文書投入部３０００の構成について説明する。文書投入部３０００は、数値抽出部３３００と、見出し語抽出部３１００と、見出し語計数部３２００と、数値範囲分布統計部３４００とを少なくとも有する。 Next, the configuration of the document input unit 3000 will be described. The document input unit 3000 includes at least a numerical value extraction unit 3300, a headword extraction unit 3100, a headword count unit 3200, and a numerical value range distribution statistical unit 3400.

見出し語抽出部３１００は、検索対象となる文書中から見出し語となる単語を検索・抽出するものであり、抽出した見出し語となる単語及びその出現数（ｔｆ）を、文書あるいは文章識別情報と対応付けて文書ＤＢ部２４００に記憶させるものである。ここで、見出し語抽出部３１００が見出し語となる単語を抽出する方法は、種々の方法を適用することができるが、例えば、見出し語となる単語を格納する単語辞書と品詞に関する規則とを有し、その単語辞書を参照して入力文書に含まれる見出し語となる単語を検索して、その検索単語を抽出する形態素解析による方法を適用することができる。 The headword extraction unit 3100 searches and extracts a word as a headword from a document to be searched. The headword extraction unit 3100 extracts the word as a headword and the number of appearances (tf) as a document or sentence identification information. The data is stored in the document DB unit 2400 in association with each other. Here, various methods can be applied to the method of extracting the word that becomes the headword by the headword extraction unit 3100. For example, the word word extraction unit 3100 has a word dictionary that stores the word that becomes the headword and a rule related to the part of speech. Then, it is possible to apply a method based on morphological analysis by referring to the word dictionary and searching for a word as a headword included in the input document and extracting the search word.

文書数計数部３２００は、見出し語抽出部３１００が見出し語として得られた単語が出現する文書数を、見出し語とした単語毎に計数するものであり、その計数した単語毎の文書数をＤＦデータ記憶部２２００に記憶させるものである。 The document number counting unit 3200 counts the number of documents in which the word obtained as the headword by the headword extraction unit 3100 appears for each word used as the headword, and the counted number of documents per word is DF. The data is stored in the data storage unit 2200.

数値抽出部３３００は、検索対象となる文書中から数値を表わす文字列を検索・抽出するものであり、抽出した数値を表わす文字列を所定の数値表現と解釈し、その解釈に従って単位種類毎の数値情報に変換するものである。また、数値抽出部３３００は、所定の数値表現に解釈した数値情報を、文書ＤＢ部２４００及び数値範囲分布統計部３４００に与えるものである。ここで、数値抽出部３３００が数値を表わす文字列を抽出する方法は、種々の方法を適用することができるが、例えば、文書から数値を検索し、その数値を含む文字列について、予め設定された数値に関連する単語（項目）を格納する単語辞書を参照して形態素解析を行なって数値と単位種類を抽出する方法や、また例えば、予め設定された文字の種類や品詞に基づく規則に基づいて文書の中から数値と単位種類を抽出する方法などを適用することができる。 The numerical value extraction unit 3300 searches and extracts a character string representing a numerical value from a document to be searched. The numerical value extracting unit 3300 interprets the character string representing the extracted numerical value as a predetermined numerical expression, and in accordance with the interpretation, for each unit type. It converts to numerical information. The numerical value extraction unit 3300 provides numerical information interpreted in a predetermined numerical expression to the document DB unit 2400 and the numerical value range distribution statistical unit 3400. Here, various methods can be applied to the method of extracting a character string representing a numerical value by the numerical value extraction unit 3300. For example, a numerical value is searched from a document, and a character string including the numerical value is set in advance. Based on a method of extracting a numerical value and a unit type by performing morphological analysis with reference to a word dictionary storing words (items) related to the numerical value, or a rule based on, for example, a preset character type or part of speech The method of extracting numerical values and unit types from documents can be applied.

数値範囲分布統計部３４００は、数値抽出部３３００から数値情報を受け取ると、単位種類毎の所定の数値区間に従って、数値情報を分類するものである。これにより、単位種類毎の予め設定した有限の区間に、数値を分類することができる。 When the numerical value range distribution statistical unit 3400 receives numerical information from the numerical value extracting unit 3300, the numerical value range distribution statistical unit 3400 classifies the numerical information according to a predetermined numerical section for each unit type. As a result, numerical values can be classified into preset finite intervals for each unit type.

例えば、数値情報が「５×１０＾７ｂｐｓ（「＾」は自乗を示す）」であれば、その桁数は７桁であるから、この数値情報を「７桁の区間」に分類し、また例えば、数値情報が「１×１０＾９」であれば、桁数は９桁であるから、この数値情報を「９桁の区間」に分類する。 For example, if the numerical information is “5 × 10 ^ 7 bps (“ ^ ”indicates square),” the number of digits is 7, so this numerical information is classified as “7-digit section”. For example, if the numerical information is “1 × 10 ^ 9”, the number of digits is nine, so the numerical information is classified into “9-digit sections”.

また、数値範囲分布統計部３４００は、数値情報を分類すると、数値情報が属する数値区間の出現文書数（ｄｆ）を計数し、数値区間毎の出現文書数（ｄｆ）を更新させるものである。これにより、数値区間に属す数値情報を含む文書数を管理させることができる。なお、数値範囲分布統計部３４００が計数した数値区間毎の出現文書数は数値範囲分布データ記憶部２３００に与えて記憶させる。 Further, when the numerical value information is classified, the numerical range distribution statistical unit 3400 counts the number of appearing documents (df) in the numerical section to which the numerical information belongs, and updates the number of appearing documents (df) for each numerical section. Thereby, the number of documents including numerical information belonging to the numerical value section can be managed. Note that the number of appearing documents for each numerical section counted by the numerical range distribution statistical unit 3400 is given to the numerical range distribution data storage unit 2300 for storage.

例えば、上記例において、「７桁の区間」の出現文書数が「７１」であれば、今回の数値情報「５×１０＾７ｂｐｓ」の分類により、出現文書数を「７２」にする。なお、数値情報の範囲が数桁に及ぶ場合（複数区間にまたがる場合）、それぞれの数値区間について計数することにより処理が可能である。 For example, in the above example, if the number of appearing documents in the “7-digit section” is “71”, the number of appearing documents is set to “72” according to the classification of the current numerical information “5 × 10 ^ 7 bps”. In addition, when the range of numerical information covers several digits (when it extends over a plurality of sections), processing can be performed by counting each numerical section.

なお、数値範囲分布統計部３４００は、単位種類毎の各数値区間のそれぞれの出現文書数を計数し、記憶させるものとしたが、全登録文書における各数値区間のそれぞれの出現総数（ＣＦ）を計数し、記憶させるようにしてもよい。このように、全登録文書中における各数値区間の出現回数を算出して記憶させることで、その数値区間を検索条件としたときに、その各数値区間の出現回数を推定量として利用することができる。 The numerical range distribution statistical unit 3400 counts and stores the number of documents appearing in each numerical section for each unit type, but the total number of appearances (CF) of each numerical section in all registered documents is stored. You may make it count and memorize | store. In this way, by calculating and storing the number of occurrences of each numerical section in all registered documents, when the numerical section is used as a search condition, the number of appearances of each numerical section can be used as an estimated amount. it can.

次に、文書検索部１０００の構成について説明する。文書検索部１０００は、入力部１１００、単語重要度算出部１２００、数値重要度算出部１３００、文書評価値算出部１４００、出力部１５００とを少なくとも有するものである。 Next, the configuration of the document search unit 1000 will be described. The document search unit 1000 includes at least an input unit 1100, a word importance calculation unit 1200, a numerical importance calculation unit 1300, a document evaluation value calculation unit 1400, and an output unit 1500.

入力部１１００は、文書検索条件であるキーワード、数値、数値範囲等を取り込み、取り込んだ情報を単語重要度算出部１２００及び数値重要度算出部１３００を与えるものである。ここで、入力部１１００が取り込み得る情報は、単語であるキーワード、数値、数値範囲等が該当し、これらの情報のそれぞれや、又はこれらをそれぞれ組み合わせたものを検索条件とする。 The input unit 1100 captures keywords, numerical values, numerical ranges, and the like, which are document search conditions, and supplies the captured information to the word importance calculation unit 1200 and the numerical importance calculation unit 1300. Here, the information that can be taken in by the input unit 1100 corresponds to a keyword, a numerical value, a numerical value range, and the like, and each of these pieces of information or a combination thereof is used as a search condition.

単語重要度算出部１２００は、入力部１１００から入力キーワードが与えられると、ＤＦデータ記憶部２２００から入力キーワードについてのｄｆデータを得、単語の検索キーとしての重要度を算出するものである。ここで、単語重要度算出部１２００による単語重要度の算出方法は、例えば、文書ＤＢ部２４００に登録されている全登録文書数を取り出し、ｌｏｇ（＜全登録文書数＞／ｄｆ）により求めたｉｄｆなどを用いる一般的な評価方法を適用することができる。また、単語重要度算出部１２００は、算出したキーワードの検索条件とする重要度を文書評価値算出部１４００に与えるものである。 When an input keyword is given from the input unit 1100, the word importance calculation unit 1200 obtains df data for the input keyword from the DF data storage unit 2200, and calculates the importance as a word search key. Here, the word importance calculation method by the word importance calculation unit 1200 is obtained by, for example, extracting the total number of registered documents registered in the document DB unit 2400 and calculating the log (<total number of registered documents> / df). A general evaluation method using idf or the like can be applied. Further, the word importance calculation unit 1200 gives the document evaluation value calculation unit 1400 importance as a search condition for the calculated keyword.

数値重要度算出部１３００は、入力部１１００から入力された数値が与えられると、その入力数値による検索条件の重要度を、数値範囲分布データ記憶部２３００のうち該検索条件に適合するデータを参照して算出するものである。ここで、数値重要度算出部１３００による重要度の算出方法は、例えば、数値範囲分布データ記憶部２３００が単位種類及び桁ごとの出現文書数が記憶されているのであれば、検索条件の単位種類及び桁についてのその出現文書数を、キーワードにおけるｄｆに相当するものとして、その重要度を算出する。その算出式は、単語重要度算出部１２００と同様に、種々の方法を適用することができるが、例えば、ｉｄｆなどの一般的に知られた手法を用いることができる。また、数値重要度算出部１３００は、算出した入力数値の検索条件とする重要度を文書評価値算出部１４００に与えるものである。 When the numerical value input from the input unit 1100 is given, the numerical importance calculation unit 1300 refers to the importance of the search condition based on the input numerical value in the numerical range distribution data storage unit 2300 and data that matches the search condition. To calculate. Here, the importance calculation method by the numerical importance calculation unit 1300 is, for example, as long as the numerical value range distribution data storage unit 2300 stores the unit type and the number of appearing documents for each digit. The importance is calculated assuming that the number of appearing documents with respect to the digit corresponds to df in the keyword. As the calculation formula, various methods can be applied as in the case of the word importance calculation unit 1200. For example, a generally known method such as idf can be used. Also, the numerical importance calculation unit 1300 gives importance to the document evaluation value calculation unit 1400 as a search condition for the calculated input numerical value.

文書評価値算出部１４００は、キーワード及び又は数値による検索条件を含む文書ＤＢ部２４００中の各文書について、そのキーワードである単語出現数及び又は数値検索条件に適合する数値情報の数を得てｔｆとするものである。また、文書評価値算出部１４００は、各文書のｔｆと、単語重要度算出部１２００及び又は数値重要度算出部１３００より得られた各検索条件の重要度とに基づいて、文書ＤＢ部２４００中の各文書の評価値を算出するものである。ここで、文書評価値算出部１４００による各文書の評価値の算出方法は、例えば、各検索条件についてのｔｆ×ｉｄｆの和を取るなどの一般的な検索の手法を用いることで実現できる。また、文書評価値算出部１４００は、算出した各文書の評価値を出力部１５００に与えるものである。 The document evaluation value calculation unit 1400 obtains, for each document in the document DB unit 2400 including a keyword and / or a numerical search condition, the number of word appearances that are the keyword and / or the number of numerical information that conforms to the numerical search condition. It is what. In addition, the document evaluation value calculation unit 1400 stores the document DB unit 2400 based on the tf of each document and the importance of each search condition obtained from the word importance calculation unit 1200 and / or the numerical importance calculation unit 1300. The evaluation value of each document is calculated. Here, the calculation method of the evaluation value of each document by the document evaluation value calculation unit 1400 can be realized by using a general search method such as taking the sum of tf × idf for each search condition. Further, the document evaluation value calculation unit 1400 gives the calculated evaluation value of each document to the output unit 1500.

出力部１５００は、文書評価値算出部１４００で得られた各文書の評価値に基づいて、検索結果として出力する文書を決定し、出力するものである。なお、出力部１５００は、文書ＤＢ部２４００に記憶されているのが文書そのものではなく文書識別情報である場合には、その文書識別情報を出力するか、又は文書識別情報に基づいて対応する文書を取得して出力するようにしてもよい。 The output unit 1500 determines and outputs a document to be output as a search result based on the evaluation value of each document obtained by the document evaluation value calculation unit 1400. If the document DB unit 2400 stores not the document itself but the document identification information, the output unit 1500 outputs the document identification information or the corresponding document based on the document identification information. May be obtained and output.

（Ａ−２）第１の実施形態の動作
次に、本実施形態の文書検索システム１における処理について図面を参照して説明する。 (A-2) Operation of First Embodiment Next, processing in the document search system 1 of this embodiment will be described with reference to the drawings.

まず、文書投入部３０００における文書が投入された場合の動作について図２のフローチャートを参照して説明する。以下では、図３に示した文書１１（図３は文書の部分である）が投入された場合を例に挙げて説明する。 First, the operation of the document input unit 3000 when a document is input will be described with reference to the flowchart of FIG. Hereinafter, a case where the document 11 shown in FIG. 3 (FIG. 3 is a part of the document) is input will be described as an example.

また、図４はＤＦデータ記憶部２２００の記憶内容例を示し、図５は数値範囲分布データ記憶部２３００の記憶内容例を示し、図６は文書ＤＢ部２４００の記憶内容例を示す。 4 shows an example of stored contents of the DF data storage unit 2200, FIG. 5 shows an example of stored contents of the numerical range distribution data storage unit 2300, and FIG. 6 shows an example of stored contents of the document DB unit 2400.

図２において、投入された文書が文書投入部３０００に取り込まれると（Ｓ１）、文書は、見出し語抽出部３１００により形態素解析が行なわれ、その形態素解析の結果から、例えば名詞などの見出し語となる単語が抽出される（Ｓ２）。 In FIG. 2, when the input document is taken into the document input unit 3000 (S1), the document is subjected to morphological analysis by the headword extraction unit 3100. From the result of the morphological analysis, for example, a headword such as a noun Are extracted (S2).

例えば、文書１１が取り込まれると、見出し語抽出部３１００による形態素解析の結果から、予め見出し語として登録されている単語である、Ｗ１１の「通信速度」、Ｗ１３の「プロバイダ」、Ｗ１５の「プロバイダ」が抽出される。ここで、見出し語は、予め用意された単語辞書などを参照することで抽出することができる。 For example, when the document 11 is captured, from the result of the morphological analysis by the headword extraction unit 3100, the words “communication speed” of W11, “provider” of W13, “provider” of W15, which are words registered in advance as headwords. Is extracted. Here, the headword can be extracted by referring to a word dictionary prepared in advance.

見出し語抽出部３１００により見出し語となる単語が抽出されると、抽出された単語は、その出現数と共に文書あるいは文書識別情報に対応付けられ、文書ＤＢ部２４００に記憶される（Ｓ３）。 When a word that becomes a headword is extracted by the headword extraction unit 3100, the extracted word is associated with a document or document identification information together with the number of appearances, and stored in the document DB unit 2400 (S3).

例えば、文書１１において、見出し語となる単語「通信速度」の出現数は「１」であり、単語「プロバイダ」の出現数は「２」であり、これらが文書あるいは文書識別情報と対応付けられ、文書ＤＢ部２４００に記憶される（図６参照）。 For example, in the document 11, the number of occurrences of the word “communication speed” as the headword is “1”, the number of occurrences of the word “provider” is “2”, and these are associated with the document or document identification information. And stored in the document DB unit 2400 (see FIG. 6).

また、見出し語抽出部３１００により文書から見出し語となる単語が抽出されると、その抽出された見出し語の単語が文書数計数部３２００に与えられ、文書数計数部３２００により、抽出された見出し語の単語のｄｆが計数される（Ｓ４）。そして、文書数計数部３２００により計数された見出し語の単語のｄｆは、ＤＦデータ記憶部２２００に記憶される（Ｓ５）。 Further, when a word that becomes a headword is extracted from the document by the headword extraction unit 3100, the extracted word of the headword is given to the document number counting unit 3200, and the headline extracted by the document number counting unit 3200 is displayed. The df of the word is counted (S4). Then, the df of the headword word counted by the document number counting unit 3200 is stored in the DF data storage unit 2200 (S5).

例えば、見出し語抽出部３１００により文書１１から「通信速度」の単語Ｗ１１が見出し語として抽出されると、その「通信速度」の単語Ｗ１１が文書数計数部３２００に与えられる。そして、文書数計数部３２００は、その単語Ｗ１１の「通信速度」について、ｄｆを１つ増やして（例えば、文書１１の投入前が「１１９」とすると「１２０」にして）、数値範囲分布データ記憶部２３００に記憶させる（図４参照）。 For example, when the word “W11” of “communication speed” is extracted as a headword from the document 11 by the headword extraction unit 3100, the word “W11” of “communication speed” is given to the document number counting unit 3200. Then, the document number counting unit 3200 increments df by one for the “communication speed” of the word W11 (for example, “120” when the document 11 is input is “119”), and the numerical range distribution data It memorize | stores in the memory | storage part 2300 (refer FIG. 4).

また、Ｓ１において、投入された文書が取り込まれると、文書は、数値抽出部３３００により、数値と単位とからなる数値情報が抽出される（Ｓ６）。 In S1, when the input document is taken in, the numerical value extraction unit 3300 extracts numerical information including numerical values and units from the document (S6).

例えば、文書１１が投入されると、数値抽出部３３００は、「５０Ｍｂｐｓ」の単語Ｗ１２を抽出する。この数値抽出部３３００による数値と単位からなる文字列の抽出処理については上述したので詳細な説明は省略する。 For example, when the document 11 is input, the numerical value extraction unit 3300 extracts the word W12 of “50 Mbps”. Since the extraction process of the character string composed of the numerical value and unit by the numerical value extraction unit 3300 has been described above, a detailed description thereof will be omitted.

数値抽出部３３００により数値と単位とからなる文字列の数値情報が抽出されると、数値抽出部３３００により、当該数値情報を、予め設定された数値区間に区分けすることができるように、当該数値情報の数値表現を所定の数値表現と解釈するように所定形式に変換する（Ｓ７）。 When the numerical value extraction unit 3300 extracts the numerical value information of the character string composed of the numerical value and the unit, the numerical value extraction unit 3300 allows the numerical value information to be divided into preset numerical value intervals. The numerical expression of information is converted into a predetermined format so as to be interpreted as a predetermined numerical expression (S7).

例えば、本実施形態では数値区間の区分けを数値の「桁」に基づいて行なうものとするので、数値抽出部３３００は、文書１１から抽出した「５０Ｍｂｐｓ」を「５×１０＾７ｂｐｓ」という数値表現に解釈し、仮数部「５」、指数部「７」、単位「ｂｐｓ」という数値情報に変換する。 For example, in this embodiment, since the division of the numerical value section is performed based on the “digit” of the numerical value, the numerical value extraction unit 3300 expresses “50 Mbps” extracted from the document 11 as a numerical value “5 × 10 ^ 7 bps”. And converted into numerical information of a mantissa part “5”, an exponent part “7”, and a unit “bps”.

数値抽出部３３００により数値情報の解釈が行なわれると、数値情報は、その出現数と共に文書あるいは文書識別情報に対応付けられて文書ＤＢ部２４００に記憶される（Ｓ８）。 When numerical value information is interpreted by the numerical value extraction unit 3300, the numerical value information is stored in the document DB unit 2400 in association with the document or document identification information together with the number of appearances (S8).

例えば、文書１１において、数値情報として抽出された「５０Ｍｂｐｓ」は「５×１０＾７ｂｐｓ」という数値表現に解釈されるので、数値情報「５×１０＾７ｂｐｓ」の出現数は「１」であり、これが文書あるいは文書識別情報と対応付けられて文書ＤＢ部２４００に記憶される（図６参照）。 For example, in the document 11, “50 Mbps” extracted as numerical information is interpreted as a numerical expression “5 × 10 ^ 7 bps”, so the number of occurrences of the numerical information “5 × 10 ^ 7 bps” is “1”. This is stored in the document DB unit 2400 in association with the document or document identification information (see FIG. 6).

また、数値抽出部３３００により数値情報の解釈が行なわれると、数値抽出部３３００により変換された数値情報が数値範囲分布統計部３４００に与えられる。そして、数値範囲分布統計部３４００により、当該数値情報の単位が確認され、数値範囲分布データにおける当該単位の当該数値情報の桁数の出現文書数が計数される（Ｓ９）。また、数値範囲分布統計部３４００により計数された、数値範囲分布データにおける当該単位の当該数値情報の桁数の出現文書数は、数値範囲分布範囲データ記憶部２３００に記憶される（Ｓ１０）。 When numerical value information is interpreted by the numerical value extraction unit 3300, the numerical value information converted by the numerical value extraction unit 3300 is given to the numerical value range distribution statistical unit 3400. Then, the numerical range distribution statistical unit 3400 confirms the unit of the numerical information, and counts the number of appearance documents of the number of digits of the numerical information of the unit in the numerical range distribution data (S9). In addition, the number of appearing documents of the number information in the unit in the numerical range distribution data counted by the numerical range distribution statistical unit 3400 is stored in the numerical range distribution range data storage unit 2300 (S10).

例えば、数値抽出部３３００から数値情報として、仮数部「５」、指数部「７」、単位「ｂｐｓ」の情報が、数値範囲分布統計部３４００に与えられる。そして、数値範囲分布統計部３４００は、単位「ｂｐｓ」を確認すると、単位「ｂｐｓ」の数値範囲分布データのうち、指数部「７」に相当する区間の出現文書数を１つ増やすようにする（例えば、文書１１を投入する前の当該出現文書数が「７１」とすると「７２」とする）（Ｓ９）。 For example, information on the mantissa part “5”, the exponent part “7”, and the unit “bps” is given to the numerical value range distribution statistical part 3400 as numerical information from the numerical value extraction part 3300. When the numerical range distribution statistical unit 3400 confirms the unit “bps”, the numerical range distribution statistical unit 3400 increases the number of appearing documents in the section corresponding to the exponent part “7” in the numerical range distribution data of the unit “bps” by one. (For example, if the number of appearing documents before inputting the document 11 is “71”, it is set to “72”) (S9).

以上のように、投入された文書に対して上記処理を行ない、例えば２０００個の文書を登録文書として文書記憶部２０００に登録した場合の一部データを図４、図５、図６に例示する。 As described above, the above processing is performed on the input document. For example, partial data when 2000 documents are registered in the document storage unit 2000 as registered documents is illustrated in FIGS. 4, 5, and 6. .

続いて、文書記憶部２０００から入力に基づく文書を検索する動作について図面を参照して説明する。 Next, an operation for retrieving a document based on an input from the document storage unit 2000 will be described with reference to the drawings.

例えば、入力部１１００からの検索条件として「無線」、「プロバイダ」、「１Ｇｂｐｓ以上」が入力されたものとする。 For example, it is assumed that “wireless”, “provider”, “1 Gbps or more” is input as a search condition from the input unit 1100.

入力部１１００に検索条件が取り込まれると（Ｓ２１）、入力された検索条件が、単語重要度算出部１２００及び数値重要度算出部１３００に与えられる。 When the search condition is taken into the input unit 1100 (S21), the input search condition is given to the word importance calculation unit 1200 and the numerical importance calculation unit 1300.

入力された検索条件が単語重要度算出部１２００に与えられると、単語重要度算出部１２００により、ＤＦデータ記憶部２２００から各キーワードのｄｆが取り出され、その各キーワードのｄｆデータと、文書ＤＢ部２４００の登録文書数とに基づいて、各キーワードの重要度が算出される（Ｓ２２）。また、入力された各キーワードと算出された各キーワードの重要度とは、文書評価値算出部１４００に与えられる。 When the input search condition is given to the word importance calculation unit 1200, the word importance calculation unit 1200 extracts the df of each keyword from the DF data storage unit 2200, and the df data of each keyword and the document DB unit Based on the number of registered documents 2400, the importance of each keyword is calculated (S22). Further, each input keyword and the calculated importance of each keyword are given to the document evaluation value calculation unit 1400.

例えば、単語重要度算出部１２００において、図４に例示したＤＦデータ記憶部２２００を参照して、入力された単語のキーワード「無線」についてのｄｆ＝３６が取り出される。また、文書ＤＢ部２４００に登録されている登録文書数が２０００であるから、単語重要度算出部１２００は、キーワード「無線」の重要度ｉｄｆ＝ｌｏｇ（２０００／３６）＝１．７４を算出する。また、単語重要度算出部１２００は、キーワード「プロバイダ」についても、同様にして、ｄｆ＝４５２をＤＦデータ記憶部２２００から取り出し、重要度ｉｄｆ＝０．６４を算出する。 For example, the word importance calculation unit 1200 refers to the DF data storage unit 2200 illustrated in FIG. 4 and extracts df = 36 for the keyword “wireless” of the input word. In addition, since the number of registered documents registered in the document DB unit 2400 is 2000, the word importance calculation unit 1200 calculates the importance “idf = log (2000/36) = 1.74” of the keyword “wireless”. . In addition, the word importance calculation unit 1200 similarly extracts df = 452 from the DF data storage unit 2200 for the keyword “provider”, and calculates importance idf = 0.64.

また、入力された検索条件が数値重要度算出部１３００に与えられると、数値重要度算出部１３００により、入力された条件について所定の数値表現に解釈し、その解釈された数値の単位及び数値の桁数が確認され、当該数値情報の単位について、当該数値情報の桁数の出現文書数が、数値範囲分布データ記憶部２３００から取り出される。また、数値重要度算出部１３００により、数値範囲分布データ記憶部２３００から取り出した当該単位の当該数値情報の桁数の出現文書数と、文書ＤＢ部２４００に登録されている登録文書数とに基づいて、数値の重要度が算出される（Ｓ２３）。また、入力条件である数値と算出された数値の重要度とは、文書評価値算出部１４００に与えられる。 When the input search condition is given to the numerical importance calculation unit 1300, the numerical importance calculation unit 1300 interprets the input condition into a predetermined numerical expression, and the unit of the interpreted numerical value and the numerical value The number of digits is confirmed, and for the unit of the numerical information, the number of appearing documents having the number of digits of the numerical information is extracted from the numerical range distribution data storage unit 2300. Also, based on the number of appearing documents of the number of digits of the numerical information of the unit extracted from the numerical value range distribution data storage unit 2300 by the numerical importance calculation unit 1300 and the number of registered documents registered in the document DB unit 2400. Thus, the importance of the numerical value is calculated (S23). Also, the numerical value as the input condition and the importance of the calculated numerical value are given to the document evaluation value calculation unit 1400.

例えば、数値重要度算出部１３００において、入力部１１００から得られた検索条件のうち数値である「１Ｇｂｐｓ以上」について処理を行う。ここで、「１Ｇｂｐｓ以上」についての解釈をすると、「１Ｇ＝１０＾９」であるから数値の桁数は９桁であり、かつ、数値範囲が「以上」であるので、図５に例示した数値範囲分布データ記憶部２３００から、単位「ｂｐｓ」の「９桁以上」について記憶された出現文書数を得る。すなわち、数値重要度算出部１３００は、図５において、「９桁」の「１４」と、「１０桁」の「３」を加算して「１７」を出現文書数とする。そして、これをｄｆと同等値とし、ｉｄｆに相当するｌｏｇ（２０００／１７）＝２．０７を、この数値条件の重要度として算出する。 For example, the numerical importance calculation unit 1300 performs processing on “1 Gbps or more” that is a numerical value among the search conditions obtained from the input unit 1100. Here, when interpreting “1 Gbps or more”, since “1G = 10 ^ 9”, the number of numerical values is 9 digits, and the numerical value range is “above”. The number of appearance documents stored for “9 digits or more” of the unit “bps” is obtained from the numerical range distribution data storage unit 2300. That is, the numerical importance calculation unit 1300 adds “9” “14” and “10” “3” to “17” as the number of appearance documents in FIG. Then, this is equivalent to df, and log (2000/17) = 2.07 corresponding to idf is calculated as the importance of this numerical condition.

文書評価値算出部１４００では、単語重要度算出部１２００及び数値重要度算出部１３００から入力条件であるキーワード及び数値を受け取ると、文書ＤＢ部２４００を参照して、これらキーワード及び数値を含む文書のそれぞれにおける、各キーワード及び数値の出現数（ｔｆ）を得る。そして、各キーワード及び数値の出現数（ｔｆ）と、単語重要度算出部１２００及び数値重要度算出部１３００で算出された重要度（ｉｄｆ）とに基づいて、各文書の評価値を算出する（Ｓ２４）。 When the document evaluation value calculation unit 1400 receives a keyword and a numerical value, which are input conditions, from the word importance calculation unit 1200 and the numerical importance calculation unit 1300, the document evaluation value calculation unit 1400 refers to the document DB unit 2400, The number of occurrences (tf) of each keyword and numerical value in each is obtained. Then, the evaluation value of each document is calculated based on the number of appearances of each keyword and numerical value (tf) and the importance (idf) calculated by the word importance calculating unit 1200 and the numerical importance calculating unit 1300 ( S24).

例えば、文書評価値算出部１４００において、図５に例示した文書ＤＢ部２４００を参照して、入力条件である「無線」、「プロバイダ」、「１Ｇｂｐｓ以上」を含む文書を検索する。例えば、図５における文書１１を取り出すと、文書１１において、「無線」の出現数ｔｆ＝０、「プロバイダ」のｔｆ＝２、「１Ｇｂｐｓ以上＝（１×１０＾９ｂｐｓ以上）」のｔｆ＝０を算出する。 For example, the document evaluation value calculation unit 1400 searches the document DB unit 2400 illustrated in FIG. 5 for documents including “wireless”, “provider”, and “1 Gbps or more” as input conditions. For example, when the document 11 in FIG. 5 is taken out, in the document 11, the appearance count tf = 0 of “wireless”, tf = 2 of “provider”, tf = 0 of “1 Gbps or more = (1 × 10 ^ 9 bps or more)” Is calculated.

そうすると、文書評価値算出部１４００において、文書１１の評価値Σｔｆ・ｉｄｆ＝（０×１．７）＋（２×０．６４）＋（０×２．０７）＝１．２８が算出される。 Then, the document evaluation value calculation unit 1400 calculates the evaluation value Σtf · idf = (0 × 1.7) + (2 × 0.64) + (0 × 2.07) = 1.28 of the document 11. .

同様に、文書１２については、「無線」のｔｆ＝０、「プロバイダ」のｔｆ＝２、「１Ｇｂｐｓ以上」は「１〜４×１０＾９ｂＰｓ」に適合するのでｔｆ＝１が得られる。この結果、文書１２の評価値Σｔｆ・ｉｄｆ＝（０×１．７４）＋（２×０．６４）＋（１×２．０７）＝３．３５を算出する。 Similarly, for the document 12, tf = 0 for “wireless”, tf = 2 for “provider”, and “1 Gbps or more” conform to “1-4 × 10 ^ 9 bPs”, so tf = 1 is obtained. As a result, the evaluation value Σtf · idf = (0 × 1.74) + (2 × 0.64) + (1 × 2.07) = 3.35 of the document 12 is calculated.

さらに、文書１３については、「無線」のｔｆ＝１、「プロバイダ」のｔｆ＝２、「１ＧｂＰｓ以上」のｔｆ＝０が得られる。この結果、文書１３の評価値Σｔｆ・ｉｄｆ＝（１×１．７４）＋（２×０．６４）＋（０×２．０７）＝３．０２を算出する。 Further, for the document 13, “wireless” tf = 1, “provider” tf = 2, and “1 GbPs or more” tf = 0 are obtained. As a result, the evaluation value Σtf · idf = (1 × 1.74) + (2 × 0.64) + (0 × 2.07) = 3.02 of the document 13 is calculated.

出力部１５００では、文書評価値算出部１４００により算出された各文書の評価値に基づいて、出力する文書を決定し、決定した文書を出力する（Ｓ２５）。 The output unit 1500 determines a document to be output based on the evaluation value of each document calculated by the document evaluation value calculation unit 1400, and outputs the determined document (S25).

例えば、以上のように、図５の文書１１〜１３においては、「無線」、「１Ｇｂｐｓ以上」の双方の条件を満たす文書がないが、いずれか一方を含む文書１２と文書１３がある場合、出現数がより少ない（ｄｆが小さい）、すなわち文書がより絞り込まれる「１Ｇｂｐｓ以上」を含む文書１２のほうが、文書１３より評価値が高くなる。 For example, as described above, in the documents 11 to 13 in FIG. 5, there is no document that satisfies both the conditions of “wireless” and “1 Gbps or more”, but there are the document 12 and the document 13 including either one of them. The document 12 including “1 Gbps or more” in which the number of appearances is smaller (df is smaller), that is, the document is further narrowed, has a higher evaluation value than the document 13.

また、例えば、入力部１１００より「無線」、「プロバイダ」、「１０〜８０Ｍｂｐｓ」という条件が与えられた場合についても、上記と同様な動作（図７に示す処理）が行なわれる。 Further, for example, even when the conditions “wireless”, “provider”, and “10 to 80 Mbps” are given from the input unit 1100, the same operation (the process shown in FIG. 7) is performed.

「無線」、「プロバイダ」の重要度については、単語重要度算出部１２００により、それぞれ上記と同じ値ｉｄｆ「１．７４」、「０．６４」が得られる。 Regarding the importance levels of “wireless” and “provider”, the word importance level calculation unit 1200 obtains the same values idf “1.74” and “0.64” as described above.

また、「１０〜８０Ｍｂｐｓ」の重要度については、数値重要度算出部１３００において、所定の数値表現である「１〜８×１０＾７ｂｐｓ」と解釈される。従って、数値重要度算出部１３００により、図５の数値範囲分布データ記憶部２３００から同じ単位「ｂｐｓ」で桁が「７桁」の出現文書数「７２」が得られる。その結果、「１０〜８０Ｍｂｐｓ」の重要度は、ｌｏｇ（２０００／７２）＝１．４４と算出される。 The importance of “10 to 80 Mbps” is interpreted by the numerical importance calculation unit 1300 as “1-8 × 10 ^ 7 bps” which is a predetermined numerical expression. Therefore, the numerical importance calculation unit 1300 obtains the number of appearance documents “72” having the same unit “bps” and “7 digits” from the numerical range distribution data storage unit 2300 of FIG. As a result, the importance of “10-80 Mbps” is calculated as log (2000/72) = 1.44.

その後、文書評価値算出部１４００において、各文書における「無線」、「プロバイダ」のそれぞれのｔｆが上記と同様にして求められる。 Thereafter, the document evaluation value calculation unit 1400 obtains tf of “wireless” and “provider” in each document in the same manner as described above.

また、「１０〜８０Ｍｂｐｓ」に関しては、この条件と桁数が適合する「５×１０＾７ｂｐｓ」についてのｔｆを参照しながら各文書の評価値を算出するものとする。 As for “10 to 80 Mbps”, the evaluation value of each document is calculated with reference to tf for “5 × 10 ^ 7 bps” in which this condition and the number of digits match.

その結果、文書１１についてはΣｔｆ・ｉｄｆ＝（０×１．７４）＋（２×０．６４）＋（１×１．４４）＝２．７２が、文書１２については（０×１．７４）＋（２×０．６４）＋（０×１．４４）＝１．２８、文書１３については（１×１．７４）＋（２×０．６４）＋（０×１．４４）＝３．０２の評価値が得られる。 As a result, Σtf · idf = (0 × 1.74) + (2 × 0.64) + (1 × 1.44) = 2.72 for document 11 and (0 × 1.74) for document 12. ) + (2 × 0.64) + (0 × 1.44) = 1.28, and for document 13, (1 × 1.74) + (2 × 0.64) + (0 × 1.44) = An evaluation value of 3.02 is obtained.

このように、「無線」と「１０〜８０Ｍｂｐｓ」双方の検索条件を満たす文書がない場合、より出現文書数が少なく絞り込み効果がある「無線」を含む文書１３が高い評価値となる。 As described above, when there is no document satisfying the search conditions of both “wireless” and “10 to 80 Mbps”, the document 13 including “wireless” having a smaller number of appearing documents and having a narrowing effect has a high evaluation value.

（Ａ−３）第１の実施形態の効果
以上のように、本実施形態によれば、数値による検索条件におけるｄｆに相当するデータを、一定区間ごと（実施形態では桁ごと）に出現数を計数して得て数値範囲分布データ記憶部２３００に記憶することで有限の情報とし、これにより、この情報を登録時に作ることを可能とした。 (A-3) Effects of the First Embodiment As described above, according to the present embodiment, the number of occurrences of data corresponding to df in the numerical search condition is determined for each fixed section (in the embodiment, for each digit). The information obtained by counting is stored in the numerical value range distribution data storage unit 2300 to obtain finite information, and this information can be created at the time of registration.

これにより、検索処理時に数値による検索条件に適合する文書数を数えることなく数値による検索条件の値を反映した重みの算出を行うことができるので、文書検索に係る処理速度を向上させることができる。 As a result, it is possible to calculate the weight reflecting the value of the numerical search condition without counting the number of documents that match the numerical search condition at the time of the search processing, so that the processing speed related to the document search can be improved. .

また、本実施形態によれば、実施形態に例示したように、単語による検索条件か数値による検索条件かによらず、与えられた検索条件による文書の絞込み具合に基づく重み付けができ、より絞り込まれた検索結果を得ることができる。 In addition, according to the present embodiment, as exemplified in the embodiment, weighting based on the narrowing of documents according to a given search condition can be performed regardless of whether the search condition is based on a word or a search condition based on a numerical value. Search results can be obtained.

（Ｂ）第２の実施形態
次に、本発明の文書検索システム、文書検索方法、文書検索プログラム及び記録媒体の第２の実施形態について図面を参照して説明する。 (B) Second Embodiment Next, a second embodiment of the document search system, document search method, document search program, and recording medium of the present invention will be described with reference to the drawings.

本実施形態は、全登録文書中の数値情報の幅の統計量を元に、数値による検索条件の数値の幅を反映したｄｆの近似値を算出するものである。 In the present embodiment, an approximate value of df that reflects the numerical range of the numerical search condition is calculated based on the statistical amount of the numerical information width in all registered documents.

（Ｂ−１）第２の実施形態の構成
第２の実施形態の構成は、図１に示す第１の実施形態の構成に対応する。ただし、数値範囲分布統計部３４００が得る数値情報の範囲の統計量と、数値数値範囲分布データ記憶部２３００に記憶される数値情報の範囲の統計量と、数値重要度算出部１３００による数値による検索条件の重要度の算出方法が、第１の実施形態と異なる。 (B-1) Configuration of Second Embodiment The configuration of the second embodiment corresponds to the configuration of the first embodiment shown in FIG. However, the numerical value range statistic obtained by the numerical value range distribution statistical unit 3400, the numerical value range statistic stored in the numerical numerical value range distribution data storage unit 2300, and the numerical importance calculation unit 1300 search by numerical values. The calculation method of the importance of the condition is different from that of the first embodiment.

従って、以下では、図１に示す文章検索システムの構成及び対応符号を用いながら、第２の実施形態の文書検索システムについて説明する。 Therefore, the document search system according to the second embodiment will be described below using the configuration and corresponding codes of the text search system shown in FIG.

図８は、第２の実施形態の数値範囲分布統計部３４００における機能構成と共に、数値範囲分布データ記憶部２３００が記憶する内容例を示したものである。 FIG. 8 shows an example of the contents stored in the numerical range distribution data storage unit 2300 together with the functional configuration of the numerical range distribution statistical unit 3400 of the second embodiment.

図８に示すように、第２の実施形態の数値範囲分布統計部３４００は、その機能として、最小値・最大値更新部３４０１、標準幅算出・更新部３４０２、出現数更新部３４０３、出現文書数更新部３４０４、出現文書期待値更新部３４０５を少なくとも有する。 As shown in FIG. 8, the numerical value range distribution statistical unit 3400 of the second embodiment includes, as its functions, a minimum value / maximum value update unit 3401, a standard width calculation / update unit 3402, an appearance number update unit 3403, and an appearance document. It has at least a number update unit 3404 and an appearance document expected value update unit 3405.

数値範囲分布統計部３４００は、数値抽出部３３００から数値情報を受け取り、その数値情報の単位種類で分類し、各分類毎にその数を計数し、各分類毎の数値情報の数を数値範囲分布データ記憶部２３００に記憶するものである。 The numerical range distribution statistical unit 3400 receives the numerical information from the numerical extraction unit 3300, classifies it by the unit type of the numerical information, counts the number for each classification, and distributes the number of numerical information for each classification to the numerical range distribution The data is stored in the data storage unit 2300.

最小値・最大値更新部３４０１は、数値抽出部３３００からの数値情報に基づいて、数値情報の最大値と最小値とを各単位種類毎に求めて、数値範囲分布データ記憶部２３００に記憶させるものである。また、最小値・最大値更新部３４０１は、同一単位種類の数値情報が新たに与えられると、数値範囲分布データ記憶部２３００の最小値２３０１及び最大値２３０２を参照し、各単位種類毎の最小値、最大値を更新するかどうかを判断し、更新する必要がある場合には最小値・最大値を更新するものである。ここで、最小値及び最大値は、例えば、数値情報が指数的な数値である場合には、その数値の対数をとって記してもよい。 The minimum value / maximum value update unit 3401 obtains the maximum value and the minimum value of the numerical information for each unit type based on the numerical information from the numerical value extraction unit 3300 and stores it in the numerical value range distribution data storage unit 2300. Is. Further, when the numerical information of the same unit type is newly given, the minimum value / maximum value update unit 3401 refers to the minimum value 2301 and the maximum value 2302 of the numerical value range distribution data storage unit 2300 and refers to the minimum value for each unit type. It is determined whether or not to update the value and the maximum value, and when it is necessary to update, the minimum value and the maximum value are updated. Here, for example, when the numerical information is an exponential numerical value, the minimum value and the maximum value may be described by taking the logarithm of the numerical value.

出現数更新部３４０３は、単位種類の数値情報が与えられると、その単位種類の出現数を更新して、数値範囲分布データ記憶部２３００の出現数２３０３に記憶させるものである。 When the numerical information of the unit type is given, the appearance number update unit 3403 updates the appearance number of the unit type and stores it in the appearance number 2303 of the numerical value range distribution data storage unit 2300.

標準幅算出・更新部３４０２は、受け取った数値情報の数値範囲の幅の標準的な値（以下、これを標準幅という）を算出し、その算出した標準幅を数値範囲分布データ記憶部２３００の標準幅２３０４に記憶させるものである。 The standard width calculation / update unit 3402 calculates a standard value of the width of the numerical range of the received numerical information (hereinafter, this is referred to as a standard width), and the calculated standard width is stored in the numerical range distribution data storage unit 2300. The data is stored in the standard width 2304.

ここで、数値範囲の幅は、ユーザの直感に合うように、ユーザ操作に応じて所定の変換することができ、例えば、指数的な数値情報である場合には、その数値情報の対数で表現した範囲の幅等とすることが望ましい。 Here, the width of the numerical value range can be converted into a predetermined value according to the user operation so as to suit the user's intuition. For example, in the case of exponential numerical information, it is expressed by the logarithm of the numerical information. It is desirable to set the width of the range.

また、範囲の標準幅の更新方法は、種々の方法が考えられるが、例えば、新たな数値情報が与えられると、既に数値範囲分布データ記憶部２３００に記憶されている標準幅２３０４に各分類の出現数を乗算し、それに新たに得られた数値情報の範囲を加えて、再度その数値情報が加えられた単位種類の出現数で除算することで得ることができる。 There are various methods for updating the standard width of the range. For example, when new numerical information is given, the standard width 2304 already stored in the numerical range distribution data storage unit 2300 is assigned to each category. It can be obtained by multiplying the number of appearances, adding the newly obtained range of numerical information, and dividing again by the number of appearances of the unit type to which the numerical information is added.

さらに、標準値算出・更新部３４０２は、範囲を持たない数値情報が与えられると、その数値に対して最低限の一定幅を設定することができる。例えば、数値の幅が値の１０％以下の場合はその値の幅を１０％とみなすことができる。 Further, when the numerical value information having no range is given, the standard value calculation / update unit 3402 can set a minimum constant width for the numerical value. For example, when the value width is 10% or less of the value, the value width can be regarded as 10%.

出現文書数更新部３４０４は、数値情報が出現する文書数を更新して、数値範囲分布データ記憶部２３００の出現文書数２３０５に記憶させるものである。 The appearance document number update unit 3404 updates the number of documents in which numerical information appears, and stores it in the number of appearance documents 2305 of the numerical value range distribution data storage unit 2300.

出現文書期待値更新部３４０５は、数値範囲分布データ記憶部２３００の最小値２３０１、最大値２３０２、標準幅２３０４、出現文書数２３０５の情報を取り出し、これらの情報に基づいて、登録された全文書において一定幅の数値範囲に適合する数値情報を含む文書数の期待値ＤＮを推定するものである。また、出現文書期待値更新部３４０５は、期待値をＤＮを数値範囲分布データ記憶部２３００の出現文書数期待値２３０６に記憶させるものである。 The appearance document expected value update unit 3405 extracts information on the minimum value 2301, the maximum value 2302, the standard width 2304, and the number of appearance documents 2305 in the numerical value range distribution data storage unit 2300, and all registered documents based on these information The estimated value DN of the number of documents including numerical information that conforms to a numerical range of a certain width is estimated. Further, the appearance document expected value update unit 3405 stores the expected value DN in the appearance document number expected value 2306 of the numerical range distribution data storage unit 2300.

ここで、出現文書期待値更新部３４０５による期待値ＤＮの求め方は、種々の方法を適用できるが、例えば、｛（標準幅）×（出現文書数）｝／｛（最大値）−（最小値）｝により求めることができる。 Here, various methods can be applied to obtain the expected value DN by the appearing document expected value update unit 3405. For example, {(standard width) × (number of appearing documents)} / {(maximum value) − (minimum). Value)}.

数値範囲分布データ記憶部２３００は、各単位種ごとに、その単位種の数値情報の全入力文書中での総数（出現数）２３０３、その単位種類の数値情報を含む出現文書数２３０５、その単位種の数値情報の数値範囲の標準的な幅（標準幅）２３０４、その単位種で文書中に現れた数値の最大値２３０２と最小値２３０１、出現文書数期待値２３０６を記憶するものである。なお、本実施形態では、標準幅２３０４、最小値２３０１、最大値２３０２は、数値情報の対数とした値であらわされる。 The numerical value range distribution data storage unit 2300 includes, for each unit type, the total number (appearance number) 2303 of the numerical information of the unit type in all input documents, the number of appearance documents 2305 including the numerical information of the unit type, and the unit The standard width (standard width) 2304 of the numerical range of the numerical value information of the seed, the maximum value 2302 and the minimum value 2301 of the numerical values appearing in the document in the unit type, and the expected number of appearance documents 2306 are stored. In the present embodiment, the standard width 2304, the minimum value 2301, and the maximum value 2302 are expressed as logarithms of numerical information.

数値重要度算出部１３００は、入力部１１０から数値の検索条件を受け取ると、数値範囲分布データ記憶部２３００から、その検索条件とする数値の単位種類についての全登録文書において一定幅の数値範囲に適合する数値情報を含む文書数の期待値ＤＮを取り出し、その期待値ＤＮを用いて検索条件となる数値が出現する文書数の期待値を推定するものである。 When the numerical importance calculation unit 1300 receives the numerical search condition from the input unit 110, the numerical importance calculation unit 1300 sets the numerical value range with a certain width in all the registered documents regarding the numerical unit types used as the search condition from the numerical range distribution data storage unit 2300. An expected value DN of the number of documents including suitable numerical information is taken out, and an expected value of the number of documents in which a numerical value serving as a search condition appears is estimated using the expected value DN.

ここで、検索条件となる数値が出現する文書数の期待値は、（検索条件の数値の幅）×ＤＮにより推定することができる。 Here, the expected value of the number of documents in which a numerical value as a search condition appears can be estimated by (numerical value search condition width) × DN.

また、数値重要度算出部１３００は、検索条件が範囲の幅を持たない数値である場合には、数値範囲分布統計部３４００の場合と同様に、その検索条件となる数値に対して一定の割合の幅を持たせるように、例えば１０％が幅の下限として、それ以下の幅のものは値の１０％の幅を持つものとして扱う、などの方法で処理ができる。 Also, when the search condition is a numerical value having no range width, the numerical importance calculation unit 1300, like the numerical range distribution statistical unit 3400, has a certain ratio to the numerical value that is the search condition. For example, 10% is treated as a lower limit of the width, and a width smaller than that is treated as having a width of 10% of the value.

このように、数値重要度算出部１３００により算出された検索条件の出現文書数の期待値を用いて、第１の実施形態で説明したように、この期待値をｄｆ相当と扱い検索条件の重要度であるｉｄｆを算出し、処理を続行できる。 As described in the first embodiment, the expected value of the number of appearance documents of the search condition calculated by the numerical importance calculating unit 1300 is treated as equivalent to df as described above in the first embodiment. The idf which is the degree can be calculated and the processing can be continued.

（Ｂ−２）第２の実施形態の動作
以下では、第２の実施形態に係る数値範囲分布統計部３４００における動作について、図８の数値範囲分布統計部３４００の機能構成図を参照しながら説明する。 (B-2) Operation of Second Embodiment Hereinafter, the operation of the numerical range distribution statistical unit 3400 according to the second embodiment will be described with reference to the functional configuration diagram of the numerical range distribution statistical unit 3400 of FIG. To do.

以下では、例えば、ある登録された文書のうち、文書Ｄ２１には「１．５〜２５Ｍｂｐｓ」という数値情報が含まれていて、また、文書Ｄ２２には「１〜１５Ｍｂｐｓ」と「５０Ｍｂｐｓ」という２つの数値情報が含まれていたものとする。また、他の文書には単位種「ｂｐｓ」である数値情報はなかったものとして、具体的な処理例を示す。 Hereinafter, for example, among registered documents, the document D21 includes numerical information “1.5 to 25 Mbps”, and the document D22 includes 2 to “1 to 15 Mbps” and “50 Mbps”. It is assumed that two pieces of numerical information are included. Further, a specific processing example will be described on the assumption that there is no numerical information of the unit type “bps” in other documents.

数値抽出部３３００により抽出された数値情報が、数値範囲分布統計部３４００に与えられると、単位種類毎の出現数及び当該単位種類を含む文書の出現文書数が更新される。 When the numerical value information extracted by the numerical value extraction unit 3300 is given to the numerical value range distribution statistical unit 3400, the number of appearances for each unit type and the number of appearance documents of documents including the unit type are updated.

例えば、文書Ｄ２１における「１．５〜２５Ｍｂｐｓ」が数値範囲分布範囲統計部３４００に与えられると、「ｂｐｓ」という単位に関する数値情報があることを確認し、数値範囲分布データ記憶部２３００に記憶された単位「ｂｐｓ」に関する出現する文書数に「１」を加算する。また、「ｂｐｓ」という単位種類の数値の出現数に「１」加算する。 For example, when “1.5 to 25 Mbps” in the document D 21 is given to the numerical range distribution range statistical unit 3400, it is confirmed that there is numerical information regarding the unit “bps” and stored in the numerical range distribution data storage unit 2300. In addition, “1” is added to the number of appearing documents regarding the unit “bps”. Further, “1” is added to the number of occurrences of the unit type numerical value “bps”.

次に、数値範囲分布統計部３４００において、数値範囲の最小値１．５Ｍと最大値２５Ｍについて対数を取り、最小値６．１８と最大値７．４０を得る。この最小値、最大値で数値範囲分布データ記憶部２３００に記憶された単位「ｂｐｓ」についての最小値、最大値を更新し、また、最大値と最小値との幅（７．４０−６．１８）＝１．２２を求め、この幅を標準幅として更新する。 Next, the numerical value range distribution statistical unit 3400 takes the logarithm of the minimum value 1.5M and the maximum value 25M of the numerical value range to obtain the minimum value 6.18 and the maximum value 7.40. The minimum value and the maximum value of the unit “bps” stored in the numerical value range distribution data storage unit 2300 with the minimum value and the maximum value are updated, and the width between the maximum value and the minimum value (7.40-6. 18) = 1.22 is obtained, and this width is updated as the standard width.

この情報が単位「ｂｐｓ」についての最初のものであるとすると、そのまま書き込まれる。その結果、数値範囲分布データ記憶部２３００の単位種「ｂｐｓ」について、出現文書数「１」、出現数「１」、最小値「６．１８」、最大値「７．４０」、標準幅「１．２２」となる。 If this information is the first for the unit “bps”, it is written as it is. As a result, for the unit type “bps” in the numerical range distribution data storage unit 2300, the number of appearance documents “1”, the number of appearances “1”, the minimum value “6.18”, the maximum value “7.40”, the standard width “ 1.22 ".

次に、文書Ｄ２２が投入されると、まず「１〜１５Ｍｂｐｓ」という数値情報について数値範囲分布統計部３４００が実行されるものとする。 Next, when the document D22 is input, it is assumed that the numerical value range distribution statistical unit 3400 is first executed for numerical information “1 to 15 Mbps”.

「１〜１５Ｍ」の対数での幅が１．１８であるので、この時点での数値範囲分布データ記憶部２３００の単位「ｂｐｓ」についての標準幅１．２２と出現数１から、（１．２２×１＋１．１８）／（１＋１）＝１．２で標準幅を更新する。 Since the logarithmic width of “1 to 15M” is 1.18, from the standard width 1.22 and the number of appearances 1 for the unit “bps” of the numerical value range distribution data storage unit 2300 at this time, (1. The standard width is updated at 22 × 1 + 1.18) / (1 + 1) = 1.2.

その他は先同様に動作し、出現文書数が１加えられて「２」に、出現数が「１」加えられて２に、最小値が「１Ｍ」の対数で更新され「６」になる。 Others operate in the same manner as before, and the number of appearing documents is incremented by 1 to “2”, the number of appearances is incremented by “1”, and the minimum value is updated to a logarithm of “1M” to become “6”.

次に、同じ文書Ｄ２２中の「５０Ｍｂｐｓ」という数値情報が処理されるものとする。この場合、幅が無い数値なので、幅は１０％相当の対数の０．０４とする。 Next, it is assumed that numerical information “50 Mbps” in the same document D22 is processed. In this case, since the value has no width, the width is set to a logarithm of 0.04 corresponding to 10%.

これについて、この時点での標準幅１．２と出現数２から、（１．２×２＋０．０４）／（２＋１）＝０．８１で、数値範囲分布データ記憶部２３００の単位種「ｂｐｓ」についての標準幅を更新する。 Regarding this, from the standard width 1.2 and the number of appearances 2 at this time, (1.2 × 2 + 0.04) / (2 + 1) = 0.81, and the unit type “bps” of the numerical range distribution data storage unit 2300 Update the standard width for.

また、最大値を「５０Ｍ」の対数である「７．７０」で更新する。文書は先と同じ文書Ｄ２２であるので出現文書数は更新されず、数値情報の「ｂｐｓ」についての出現数が１が加えられて「３」に更新される。 Also, the maximum value is updated with “7.70” which is the logarithm of “50M”. Since the document is the same document D22 as before, the number of appearing documents is not updated, and the number of appearances for the numerical information “bps” is incremented by 1 and updated to “3”.

以上の操作により、数値範囲分布データ記憶部２３００には、単位種「ｂｐｓ」について、その数値情報の出現数の総数が３、数値情報が出現する文書数が２、数値の標準的な幅は対数で「０．８１」、最小値は対数で「６」、最大値は対数で「７．７０」となる。 As a result of the above operation, the numerical value range distribution data storage unit 2300 stores, in the unit type “bps”, the total number of appearances of the numerical information is 3, the number of documents in which the numerical information appears is 2, and the standard width of the numerical values is The logarithm is “0.81”, the minimum value is “6” logarithm, and the maximum value is “7.70” logarithm.

また、これらより、単位「ｂｐｓ」について登録された全文書において一定幅の数値範囲に適合する数値情報を含む文書数の期待値ＤＮは、（０．８１×２）／（７．７０−６）＝０．９５と算出され、これも数値範囲分布データ記憶部２３００に記憶される。 Further, from these, the expected value DN of the number of documents including numerical information that conforms to a numerical range of a certain width in all documents registered for the unit “bps” is (0.81 × 2) / (7.70-6). ) = 0.95, which is also stored in the numerical range distribution data storage unit 2300.

続いて、文書検索時の数値重要度算出部１３００の動作について説明する。 Next, the operation of the numerical importance calculation unit 1300 during document search will be described.

例えば、入力部１１００が取り込んだ数値検索条件が「１〜１０Ｍｂｐｓ」であったとする。 For example, it is assumed that the numerical value search condition captured by the input unit 1100 is “1 to 10 Mbps”.

まず、この検索条件の単位種「ｂｐｓ」について数値範囲分布データ記憶部２３００を参照して得られる期待値ＤＮは０．９５である。従って、この期待値ＤＮと、検索条件の対数での最小値「６」、最大値「７」から、出現文書数の推定値を（７−６）×０．９５＝０．９５と算出する。 First, the expected value DN obtained by referring to the numerical range distribution data storage unit 2300 for the unit type “bps” of this search condition is 0.95. Therefore, the estimated value of the number of appearing documents is calculated as (7−6) × 0.95 = 0.95 from the expected value DN, the logarithm of the search condition, “6”, and the maximum value “7”. .

これを用いて、あとは第１の実施形態と同様に、この出現文書数の推定値をｉｄｆ相当の値として検索条件の重要度を算出し、文書の検索処理が行なわれる。 Using this, as in the first embodiment, the importance of the search condition is calculated by using the estimated value of the number of appearing documents as a value corresponding to idf, and the document search process is performed.

また、数値検索条件が「２０〜２５Ｍｂｐｓ」と、先の説明の「１〜１０Ｍｂｐｓ」より狭い範囲の検索条件の場合については、文書あたりの情報出現数の算出までは同様に動作する。 In the case where the numerical search condition is “20 to 25 Mbps” and the search condition is in a range narrower than “1 to 10 Mbps” described above, the same operation is performed until the number of information appearances per document is calculated.

そして、検索条件での対数の最小値「７．３０」と、最大値「７．４０」から、出現文書数の推定値を（７．４０−７．３０）×０．９５＝０．０９５と算出する。 Then, from the minimum value “7.30” of the logarithm in the search condition and the maximum value “7.40”, an estimated value of the number of appearing documents is (7.40−7.30) × 0.95 = 0.095. And calculate.

以上の数値範囲分布データ記憶部２３００の数値範囲分布統計部３４００による登録と数値重要度算出部１３００による参照は、単位種ごとに行なうように説明した。 It has been described that the registration by the numerical range distribution statistical unit 3400 of the numerical range distribution data storage unit 2300 and the reference by the numerical importance calculation unit 1300 are performed for each unit type.

しかし、第１の実施形態のように、数値を単位種ごと、さらに桁などの数値の一定区間ごとに分類したうえで実施することもできる。 However, as in the first embodiment, the numerical values may be classified into unit types and further classified into certain intervals of numerical values such as digits.

このような組み合わせにより、数値による検索条件の値と、値の持つ幅双方を考慮した出現文書数の推定を行うことができ、それに基づく数値による検索条件の重み付けが可能となる。 By such a combination, it is possible to estimate the number of appearance documents in consideration of both the value of the search condition by the numerical value and the width of the value, and weighting of the search condition by the numerical value based on it can be performed.

（Ｂ−３）第２の実施形態の効果
以上のように、本実施形態によれば、入力された検索条件が「２０〜２５Ｍｂｐｓ」と「１〜１０Ｍｂｐｓ」の場合を比較すると、より狭い範囲の検索条件、すなわち、より文書を絞り込む可能性が高いもののほうが推定される出現文書数の期待値が少なくなる。これに基づいて数値重要度算出部１３００においてその検索条件の重み付けが行なわれるため、狭い範囲の数値による検索条件のほうが、より高い重要度が算出される。 (B-3) Effects of the Second Embodiment As described above, according to the present embodiment, when the input search conditions are “20 to 25 Mbps” and “1 to 10 Mbps”, a narrower range is compared. The expected value of the number of appearing documents is reduced under the search conditions, that is, those that are more likely to narrow down documents. Based on this, the numerical importance calculation unit 1300 weights the search condition, so that a higher importance is calculated for the search condition based on a narrow range of numerical values.

また、本実施形態によれば、この出現文書数の推定に必要な統計量が、文書投入時に数値範囲分布統計部３４００によりあらかじめ算出され、数値範囲分布データ記憶部２３００に記憶されている。 Further, according to the present embodiment, the statistic necessary for estimating the number of appearing documents is calculated in advance by the numerical range distribution statistical unit 3400 at the time of document input and stored in the numerical range distribution data storage unit 2300.

すなわち、推定に必要な統計量の算出のために検索条件に合致する文書の検索および計数を検索時に行うことなく、検索条件となる数値の範囲による絞込みの可能性を考慮した、数値検索条件の重要度の算出が可能となっている。 In other words, in order to calculate the statistics necessary for estimation, the search of the documents that match the search conditions and the count are not performed at the time of the search, and the numerical search conditions are considered in consideration of the possibility of narrowing down by the range of numerical values as the search conditions. The importance can be calculated.

（Ｃ）他の実施形態
本発明は、数値を検索条件として文書検索を行う場合に、複数の数値の条件、もしくは数値による条件と単語による検索条件が設定された場合、それぞれの検索条件の重みを算出する処理の効率を向上させるために用いることができる。 (C) Other Embodiments In the present invention, when a document search is performed using a numerical value as a search condition, when a plurality of numerical conditions, or a numerical condition and a word search condition are set, the weight of each search condition Can be used to improve the efficiency of the process of calculating.

具体的には、文書をデータベースに登録する時点で文書中の数値の情報の統計的な傾向を算出してデータベースに収めるような文書登録を行なうことができる。これにより、検索時には数値の文書集合中での傾向を算出する必要をなくすことができる。また、検索時には数値による検索条件の場合はその数値の範囲によって重みを変えるような検索処理を行なうことができる。 Specifically, when a document is registered in the database, it is possible to perform document registration so that a statistical tendency of numerical information in the document is calculated and stored in the database. Thereby, it is possible to eliminate the need to calculate the tendency in the numerical document set at the time of retrieval. In the case of a search condition using numerical values, a search process can be performed in which the weight is changed depending on the range of the numerical values.

このように、文書中の数値の出現傾向について登録時に可能なものは算出してデータベースに収め、検索時にはそれを参照することで処理パフォーマンスの低下を抑えるために、本発明は利用される。 As described above, the present invention is used to calculate the appearance tendency of numerical values in a document at the time of registration and store it in a database and refer to it at the time of retrieval to suppress a decrease in processing performance.

上述した第２の実施形態では、数値範囲分布統計部３４００が、単語種類毎に標準幅、出現数、出現文書数、最大値、最小値及び期待値を算出するものとして説明したが、第１の実施形態と融合させて、各単語種類毎に予め設定した数値区間毎に標準幅、出現数、出現文書数、最大値、最小値及び期待値を算出するようにしてもよい。 In the second embodiment described above, the numerical range distribution statistical unit 3400 has been described as calculating the standard width, the number of appearances, the number of appearance documents, the maximum value, the minimum value, and the expected value for each word type. In combination with this embodiment, the standard width, the number of appearances, the number of appearance documents, the maximum value, the minimum value, and the expected value may be calculated for each numerical section preset for each word type.

第１及び第２の実施形態の文書検索システムの機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of the document search system of 1st and 2nd embodiment. 第１の実施形態の文書投入部における動作を示す動作フローチャートである。It is an operation | movement flowchart which shows the operation | movement in the document input part of 1st Embodiment. 第１の実施形態の文書投入部に投入する文書例を説明する説明図である。It is explanatory drawing explaining the example of a document thrown into the document insertion part of 1st Embodiment. 第１の実施形態のＤＦデータ記憶部の内容例を示す説明図である。It is explanatory drawing which shows the example of the content of the DF data storage part of 1st Embodiment. 第１の実施形態の数値範囲分布データ記憶部の内容を示す説明図である。It is explanatory drawing which shows the content of the numerical range distribution data storage part of 1st Embodiment. 第１の実施形態の文書ＤＢ部の内容例を示す説明図である。It is explanatory drawing which shows the example of the content of the document DB part of 1st Embodiment. 第１の実施形態の文書検索部における動作を示すフローチャートである。It is a flowchart which shows the operation | movement in the document search part of 1st Embodiment. 第２の実施形態の数値範囲分布統計部の機能構成及び数値範囲分布データ記憶部の内容例を示すブロック図である。It is a block diagram which shows the function structure of the numerical range distribution statistics part of 2nd Embodiment, and the content example of a numerical range distribution data storage part.

Explanation of symbols

１…文書検索システム、１０００…文書検索部、１１００…入力部、１２００…単語重要度算出部、１３００…数値重要度算出部、１４００…文書評価値算出部、１５００…出力部、２０００…文書記憶部、２２００…ＤＦデータ記憶部、２３００…数値範囲分布データ記憶部、２４００…文素ＤＢ部、３０００…文書投入部、３１００…見出し語抽出部、３２００…文書数計数部、３３００…数値抽出部、３４００…数値範囲分布統計部。
DESCRIPTION OF SYMBOLS 1 ... Document search system, 1000 ... Document search part, 1100 ... Input part, 1200 ... Word importance calculation part, 1300 ... Numerical importance calculation part, 1400 ... Document evaluation value calculation part, 1500 ... Output part, 2000 ... Document storage 2200 ... DF data storage unit, 2300 ... Numerical range distribution data storage unit, 2400 ... Textile DB unit, 3000 ... Document input unit, 3100 ... Entry word extraction unit, 3200 ... Document number counting unit, 3300 ... Numerical value extraction unit 3400 ... Numerical range distribution statistics section.

Claims

A document search system for searching for a document that meets the above search condition from a registered document registered with at least numerical information as a search condition,
Numerical information classification means for classifying each of one or more numerical information extracted from the input document for each unit type of numerical information;
A feature amount calculating means for calculating a feature amount for estimating a document including the numerical information with a predetermined numerical range for each unit type;
Feature quantity storage means for storing the feature quantity calculated by the feature quantity calculation means for each unit type;
Based on the unit type of the input numerical information input as the search condition, the feature quantity of the corresponding unit type is acquired from the feature quantity storage unit, and input as the search condition based on at least the acquired feature quantity Numerical importance calculation means for calculating the weight of the input numerical information,
A document retrieval system comprising: a document retrieval unit that retrieves a document from a registered document based on the weight of the input numerical information calculated by the numerical importance calculation unit.

The feature amount calculating means is
A distribution statistics unit that classifies each numerical information extracted from the input document according to a predetermined numerical section preset according to the unit type, and obtains distribution statistics of the predetermined numerical section for each unit type ,
The numerical importance calculation means calculates a weight of the input numerical information input as the search condition based on a distribution statistical result of the predetermined numerical interval corresponding to a unit type of the input numerical information. The document search system according to claim 1.

The numerical importance calculation means calculates an estimated value of the number of documents matching the input numerical information input as a search condition based on the feature quantity stored in the feature quantity storage means, and the estimated value of the document The document search system according to claim 1, wherein the weight of the input numerical information input as the search condition is calculated based on the search condition.

The feature amount calculating means is
Based on each numerical information classified for each unit type, an expected value calculation unit that calculates an expected value of the number of registered documents including numerical information that matches a numerical condition having a range of a certain width,
4. The numerical importance calculation means calculates the weight of input numerical information input as the search condition based on the expected value corresponding to the unit type of the input numerical information. The document search system described in any of the above.

The expectation value calculation unit, for the numerical information in all the registered documents, appears for each unit type and / or for each predetermined numerical section, a standard width of numerical values and an appearing document in which the numerical information appears. And a numerical range in which all numerical information appears, and is constant for each unit type and / or the predetermined numerical section based on the standard width, the number of appearing documents, and the numerical range. 5. The document search system according to claim 4, wherein the number of registered documents including a numerical value in the width range is calculated as an expected value.

The distribution statistics unit is characterized by processing a logarithm of the numerical value of the numerical information extracted from the input document,
The document retrieval system according to claim 2, wherein the numerical importance calculation means processes a logarithm of the numerical value of the input numerical information.

A document search method for searching a document that meets the above search condition from a registered document registered with at least numerical information as a search condition,
A numerical information classification step of classifying each of one or more pieces of numerical information extracted from the input document for each unit type of numerical information;
A feature amount calculating step for calculating a feature amount for estimating a document including the numerical information with a predetermined numerical range for each unit type; and
A feature amount storage means for storing the feature amount calculated in the feature amount calculation step for each unit type;
Based on the unit type of the input numerical information input as the search condition, the feature quantity of the corresponding unit type is acquired from the feature quantity storage unit, and input as the search condition based on at least the acquired feature quantity Numerical importance calculation step for calculating the weight of input numerical information,
A document search method comprising: a document search step of searching for a document from a registered document based on the weight of the input numerical information calculated by the numerical importance calculation step.

A document search program for searching a document that meets the above search condition from a registered document registered with at least numerical information as a search condition,
On the computer,
Numerical information classification means for classifying each of one or more numerical information extracted from the input document for each unit type of numerical information,
A feature amount calculating means for calculating a feature amount for estimating a document including the numerical information with a predetermined numerical range for each unit type;
Feature quantity storage means for storing the feature quantity calculated by the feature quantity calculation means for each unit type;
Based on the unit type of the input numerical information input as the search condition, the feature quantity of the corresponding unit type is acquired from the feature quantity storage unit, and input as the search condition based on at least the acquired feature quantity Numerical importance calculation means for calculating the weight of input numerical information,
A document search program for functioning as a document search means for searching a document from a registered document based on the weight of input numerical information calculated by the numerical importance calculation means.

A computer-readable recording medium storing a document search program for searching for a document that matches the search condition from a registered document registered with at least numerical information as a search condition, wherein the document search program is defined in claim 7. A computer-readable recording medium having a document search program recorded thereon, which corresponds to the document search program described above.