JP2005301856A

JP2005301856A - Method and program for document retrieval, and document retrieving device executing the same

Info

Publication number: JP2005301856A
Application number: JP2004119859A
Authority: JP
Inventors: Hisao Mase; 久雄間瀬; Makoto Iwayama; 真岩山; Yuichi Ogawa; 祐一小川; Kazutake Kurenishi; 一毅久連石
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-04-15
Filing date: 2004-04-15
Publication date: 2005-10-27
Anticipated expiration: 2024-04-15
Also published as: JP4426894B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve precision of retrieval of a large quantity of text documents by input of a text and to save the trouble to specify a desired document from a retrieval result. <P>SOLUTION: A sum set of higher documents as results of retrieval based upon a keyword extracted from a plurality of partial character strings constituting an input text is generated, and a product set of retrieval results based upon a keyword extracted from the whole input text is generated. A document, specified by a user, which has retrieval order by a partial character string than retrieval order by the whole input text is specified, and the partial character string or keyword information used for the retrieval is displayed to the user. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、テキストデータを含む文書を検索する文書検索方法、文書検索プログラムおよびこれを実行する文書検索装置に関する。 The present invention relates to a document search method, a document search program, and a document search apparatus that executes the document search method for searching a document including text data.

テキストデータを含む大量の文書群から所望の文書を検索する手法として、テキスト（文章またはキーワード集合）を検索条件として入力し、この入力テキストの内容に類似する文書を検索するものがある。すなわち、入力されたテキストから抽出される一つ以上の重み付きキーワードで構成されるキーワードベクトルと、検索対象文書群を構成する各文書から予め抽出された一つ以上の重み付きキーワードで構成されるキーワードベクトルとの間の内積または余弦を算出することにより、入力テキストと検索対象文書との間の内容の類似度を定量化し、類似度の高い文書を検索結果として出力する手法である。 As a technique for searching for a desired document from a large number of documents including text data, there is a method of inputting text (sentence or keyword set) as a search condition and searching for a document similar to the content of the input text. That is, a keyword vector composed of one or more weighted keywords extracted from the input text and one or more weighted keywords extracted in advance from each document constituting the search target document group. This is a technique for quantifying the similarity of contents between an input text and a search target document by calculating an inner product or cosine between keyword vectors and outputting a document having a high similarity as a search result.

上記手法において、入力テキストまたは検索対象文書からキーワードを抽出する際には、単語の字種（ひらがな／カタカナ／漢字／アルファベット／特殊記号等）や、単語辞書に定義される単語情報（見出し，品詞等）を手掛かりとする。また、明らかにキーワードとして不適切な単語は、不要語として除去される。 In the above method, when extracting keywords from the input text or the search target document, the word type (hiragana / katakana / kanji / alphabet / special symbol, etc.) and word information (headings, parts of speech) defined in the word dictionary are extracted. Etc.). Also, words that are clearly inappropriate as keywords are removed as unnecessary words.

また、上記手法において、キーワードに重みを付与する際には、「ＴＦ・ＩＤＦ法」と呼ばれる手法を使うのが一般的である。すなわち、あるテキスト内でのあるキーワードの出現頻度（ＴＦ）と、検索対象文書群におけるそのキーワードの出現文書数の逆数（ＩＤＦ）という二つの値を用いてキーワードに付与する重みを算出する手法である。広く市販されている文書検索システムでは、これらの値をそのまま使うのではなく、対数関数（ｌｏｇ）等を施して値を補正するのが一般的である。また、出現頻度（ＴＦ）はテキストが長いほど大きくなるため、出現頻度（ＴＦ）の値をテキスト長等で補正する場合が多い。 In addition, in the above method, a method called “TF / IDF method” is generally used when assigning weights to keywords. That is, a method of calculating a weight to be given to a keyword using two values of an appearance frequency (TF) of a keyword in a text and an inverse number (IDF) of the number of appearance documents of the keyword in the search target document group. is there. In document retrieval systems that are widely available on the market, these values are not used as they are, but are generally corrected by applying a logarithmic function (log) or the like. Further, since the appearance frequency (TF) increases as the text becomes longer, the value of the appearance frequency (TF) is often corrected by the text length or the like.

さらに、上記手法において、例えば、非特許文献１に開示されているように、キーワードを抽出する際に入力テキストまたは検索対象文書の構造的特性を利用する手法が知られている。例えば、検索対象文書が特許明細書である場合、キーワードを抽出する範囲を「特許請求の範囲」や「要約」等に限定するという手法である。 Further, in the above method, as disclosed in Non-Patent Document 1, for example, a method is known that uses a structural characteristic of an input text or a search target document when extracting a keyword. For example, when the search target document is a patent specification, the range for extracting keywords is limited to “claims”, “summary”, and the like.

また、抽出されたキーワードに重みを付与する際にも、非特許文献２に開示されているように、そのキーワードが文書のどこに出現したかによってその重みの値を変えるという手法が知られている。例えば、検索対象文書が特許明細書である場合、「発明の名称」に出現するキーワードの重みの値を高くし、請求項の末尾（「〜を特徴とする」という表現以降）に出現する名詞句に含まれるキーワードの重みを高くするという手法である。 In addition, when a weight is assigned to an extracted keyword, as disclosed in Non-Patent Document 2, a method is known in which the weight value is changed depending on where the keyword appears in the document. . For example, when the search target document is a patent specification, the noun appearing at the end of the claim (after the expression “characterized by”) by increasing the value of the weight of the keyword appearing in the “invention name” This is a technique of increasing the weight of keywords included in a phrase.

さらに、例えば、特開平１１−０８５７８６号公報に開示されるように、検索結果を基礎に、さらに検索を展開する手法も知られている。 Furthermore, as disclosed in, for example, Japanese Patent Application Laid-Open No. 11-085786, a technique for further expanding a search based on a search result is also known.

特開平１１−０８５７８６号公報Japanese Patent Laid-Open No. 11-085786

水野恵雄：「類似文献自動検索システムについて」，特許庁技術懇話会会誌，No.223，pp.9-15，2002.5.15Keio Mizuno: “On an automatic retrieval system for similar documents”, Journal of the Japan Patent Office Technical Conference, No.223, pp.9-15, 2002.5.15 間瀬久雄他：「特許テーマ分類方式の提案とその評価実験」，情報処理学会論文誌，第３９巻第７号，pp.2207-2216，1998年7月Hisao Mase et al .: "Proposal of patent theme classification method and its evaluation experiment", Transactions of Information Processing Society of Japan, Vol.39, No.7, pp.2207-2216, July 1998

上記従来技術では、利用者によって入力されたテキストに含まれるキーワードを全て使って検索する。この場合、所望の文書に関連の深い重要キーワードと関連のないノイズキーワードが混在していることが多い。その結果、重要キーワードに比べてノイズキーワードが類似度計算に与える影響が多いと、重要キーワードが含まれているにも関わらず、所望の文書が検索結果の上位に出力されず、検索漏れが生じてしまう（課題１）。 In the above prior art, the search is performed using all the keywords included in the text input by the user. In this case, important keywords that are closely related to a desired document and noise keywords that are not related are often mixed. As a result, if the noise keyword has more influence on the similarity calculation than the important keyword, the desired document will not be output at the top of the search results even though the important keyword is included, resulting in a search omission. (Problem 1).

また、上記従来技術では、文書検索結果の中から利用者の所望する文書だけを見つけるのに多大な時間がかかっている（課題２）
本発明は、重要キーワードが含まれているにも関わらず、ノイズキーワードの存在によって検索漏れが生じてしまうという問題を解決し、検索漏れの少ない、従来手法よりも高い検索精度を実現する文書検索方法、文書検索プログラムおよびこれを実行する文書検索装置を提供することを目的とする。 Further, in the above conventional technique, it takes a long time to find only the document desired by the user from the document search results (Problem 2).
The present invention solves the problem that a search omission is caused by the presence of a noise keyword even though an important keyword is included, and a document search that realizes a higher search accuracy than conventional methods with less omission of search. It is an object to provide a method, a document search program, and a document search apparatus that executes the method.

本発明では上記課題１を解決すべく、利用者によって入力されたテキストからのキーワードの抽出、抽出されたキーワードに対する重要度に相当する重みの付与、検索対象文書群の各文書から重み付きキーワードに対応する文書を検索結果として出力する文書検索方法において、前記入力されたテキストを複数の部分文字列に分割し、この分割された部分文字列の各々から抽出された部分文字列キーワードに着目して検索した検索結果と、前記入力されたテキスト全体のキーワードに着目して検索した検索結果とを利用して検索を実行する。 In the present invention, in order to solve the above problem 1, a keyword is extracted from text input by a user, a weight corresponding to the importance of the extracted keyword is assigned, and each document in the search target document group is changed to a weighted keyword. In the document search method for outputting a corresponding document as a search result, the input text is divided into a plurality of partial character strings, and attention is paid to partial character string keywords extracted from each of the divided partial character strings. A search is executed using the search result searched and the search result searched focusing on the keyword of the whole input text.

また、本発明は上記課題２を解決すべく、検索結果として出力された文書集合の中から利用者によって選択指定された文書のみを検索対象として実行された、前記入力されたテキスト全体のキーワードに着目した検索結果と、上記部分文字列キーワードに着目した検索結果とを利用して、部分文字列キーワードに着目した検索による検索結果の順位の方が高い部分検索結果を特定し、使用されたキーワードまたは部分文字列を特定し、これらを他と異なる態様とした形で利用者に表示する。また、利用者が選択した部分文字列またはキーワードの組に着目した検索結果と、前記入力されたテキスト全体のキーワードに着目した検索結果とを利用して、利用者が選択した部分文字列またはキーワードの組に着目した検索による検索結果の順位の方が高い部分検索結果を特定し、使用された部分文字列またはキーワードの組による検索結果の順位の方が高い文書を特定し、前記特定された文書を利用者に表示する。 In addition, in order to solve the above-described problem 2, the present invention executes the search for only the documents selected and specified by the user from the document set output as the search results. Using the search result focused on and the search result focused on the partial character string keyword above, the partial search result with the higher ranking of the search result based on the search focused on the partial character string keyword is identified and used. Alternatively, partial character strings are specified, and these are displayed to the user in a different form. Further, the partial character string or keyword selected by the user using the search result focusing on the partial character string or keyword set selected by the user and the search result focusing on the keyword of the entire input text. A partial search result with a higher search result ranking by a search focusing on a set of is identified, a document with a higher search result ranking by a used partial string or keyword set is identified, and the identified Display the document to the user.

本発明によれば、重要キーワードが含まれているにも関わらず、ノイズキーワードの存在によって検索漏れが生じてしまうことが少なくなり、検索漏れが少なくなる。また、検索結果から所望の文書を容易に見つけることが可能となる。 According to the present invention, although an important keyword is included, a search omission is less likely to occur due to the presence of a noise keyword, and an omission of search is reduced. Further, a desired document can be easily found from the search result.

本発明の実施の形態を以下、図面を用いて詳細に説明する。なお、これにより本発明が限定されるものではない。 Embodiments of the present invention will be described below in detail with reference to the drawings. Note that the present invention is not limited thereby.

本実施形態では、大量の特許公開公報テキスト群を検索して、利用者から入力された請求項テキストの内容に類似する文書（入力された請求項に記載された発明内容を無効化する文書）を取得する特許検索システムについて述べる。本システムは、利用者から入力されるテキストデータおよび検索対象となる特許公開公報テキスト群に出現するキーワードに着目し、前述の「ＴＦ・ＩＤＦ法」を用いて入力テキストの内容に類似する文書を検索する。なお、本実施形態では日本語テキストを対象としているが、英語等のテキストでも適用可能である。 In the present embodiment, a large number of patent publication gazette text groups are searched and a document similar to the content of the claim text input by the user (document that invalidates the content of the invention described in the input claim) A patent search system for acquiring This system focuses on text data input by users and keywords appearing in patent publication gazette text groups to be searched, and documents similar to the contents of the input text using the “TF / IDF method” described above. Search for. In this embodiment, Japanese text is targeted, but text in English or the like is also applicable.

図１は本実施形態の概要を示す図である。本システムでは、テキストを入力として、そのテキストの内容に類似する文書検索結果リストを出力する。従来方法では、入力テキスト１０１全体から重み付きキーワード１０２を抽出し、検索実行１０３を行って、検索結果１０４を出力するものであった。本実施形態では、入力テキスト１０１を部分文字列１０１ａ、１０１ｂ、１０１ｃ、１０１ｄに分割し、それぞれの部分文字列から部分文字列キーワード１０２ａ、１０２ｂ、１０２ｃ、１０２ｄを抽出し、部分文字列キーワードに基づいてそれぞれ検索実行１０３を行い、部分検索結果１０４ａ、１０４ｂ、１０４ｃ、１０４ｄを得る。そして、これらの部分検索結果の上位Ｎ件（ここではＮ＝１００）をマージして部分検索結果和集合１０５を生成する。最後に、入力テキスト全体に対する検索結果（全体検索結果１０４）を部分検索結果和集合１０５でフイルタリングして、全体検索結果１０４の中から、部分検索結果和集合１０５に含まれる文書を、その順序を変えずに抽出し、これから上位Ｍ件（ここではＭ＝２００）を最終検索結果１０６として出力する。本実施形態では「最終」出力結果としているが、最終検索結果１０６に表れる文書群を検索対象として別の検索手法の入力とし、さらに検索を続けても良い。 FIG. 1 is a diagram showing an outline of the present embodiment. This system takes a text as input and outputs a document search result list similar to the text content. In the conventional method, the weighted keyword 102 is extracted from the entire input text 101, the search execution 103 is performed, and the search result 104 is output. In the present embodiment, the input text 101 is divided into partial character strings 101a, 101b, 101c, and 101d, the partial character string keywords 102a, 102b, 102c, and 102d are extracted from the partial character strings, and based on the partial character string keywords. Then, search execution 103 is performed, and partial search results 104a, 104b, 104c, and 104d are obtained. Then, the partial search result union set 105 is generated by merging these top N search results (here, N = 100). Finally, the search result (the entire search result 104) for the entire input text is filtered by the partial search result union set 105, and the documents included in the partial search result union set 105 are selected from the entire search result 104 in the order. The top M items (here, M = 200) are output as the final search results 106. In the present embodiment, the “final” output result is used. However, the search may be continued by inputting another search method using the document group appearing in the final search result 106 as a search target.

図１で使用している文書ＩＤに書かれた番号は、全体検索結果１０４での順位と等しい。すなわち、Ｄｏｃ１００なる文書は、全体検索結果において１００位にランクされた文書であることを示している。ここで、部分検索結果１０４ａに着目すると、文書Ｄｏｃ１６４は、入力テキスト全体での検索結果は１６４位であるが、部分文字列１０１ａでの検索結果は３位であることが分かる。この結果、部分検索結果和集合１０５でフイルタリングした結果は１２０位となり、テキスト全体での検索では順位が低いが、部分文字列による検索では高い順位となる文書を、検索結果の表示では、順位を上げて提示することができる。その結果、利用者の注意をより引き付けることができるから、検索の漏れを少なくできる。 The number written in the document ID used in FIG. 1 is equal to the rank in the overall search result 104. That is, the document “Doc100” is a document ranked 100th in the overall search result. Here, focusing on the partial search result 104a, the document Doc 164 has a search result for the entire input text in the 164th place, but it can be seen that the search result in the partial character string 101a is in the third place. As a result, the result of filtering with the partial search result union 105 is 120th, and a document that has a low rank in the search by the whole text but a high rank in the search by the partial character string is displayed in the rank in the search result display. Can be presented. As a result, the user's attention can be further attracted, so that omissions in search can be reduced.

図２Ａは、本実施形態で述べるシステムを利用者の操作、各種データおよびデータの処理に関するプログラムを関連付けて表示したブロック図である。 FIG. 2A is a block diagram in which the system described in the present embodiment is displayed in association with programs relating to user operations, various data, and data processing.

利用者は本システムの入出力部１を介して、検索条件としてのテキストデータ、例えば、請求項テキストを入力する。入力されたテキストは入力テキスト２に格納保持される。ここでは、テキストはキーボードから利用者が入力しても良いし、マウス等を使ってのテキストデータのドラッグ＆ドロップやコピー，音声やペン，ＯＣＲ等による入力でも構わない。また、他のプログラムから自動的にテキストデータが渡されるような形態でも構わない。 A user inputs text data as search conditions, for example, claim text, via the input / output unit 1 of the system. The input text is stored and held in the input text 2. Here, the text may be input by the user from the keyboard, or may be input by dragging and dropping text data using a mouse or the like, voice, pen, OCR, or the like. Further, the text data may be automatically passed from another program.

検索条件となる請求項テキストデータを入力した後に、利用者が検索実行を指示すると、キーワード抽出プログラム３によってキーワードが抽出され、各キーワードに重みが付与される。以下、キーワード抽出プログラム３について説明する。 After inputting the claim text data as a search condition, when the user instructs execution of the search, the keyword extraction program 3 extracts the keyword, and gives a weight to each keyword. Hereinafter, the keyword extraction program 3 will be described.

まず、部分文字列生成３１では、入力テキスト２を構文的・意味的・表現された文字あるいは記号的にまとまった単位に分割する。本実施形態では、境界条件として予め指定された文字列を用いてテキストを分割する。これらの境界条件は、パラメータ設定データ１７（図７参照）において、部分文字列抽出照合文字列１７２として格納される。図７では、境界条件として、読点、句点、コンマ、セミコロンといった記号文字、改行コード（￥ｎ）、「ことを特徴とする」といった特定文字列などが定義されている。もちろん、この方法以外に、テキストの冒頭から予め定義された文字数または単語数毎に強制的に分割しても良いし、検索の前処理として利用者が入出力部１を介して手作業で部分文字列の境界を指定しても良いし、テキストを形態素解析および構文解析して、テキストを段落（改行とインデントから容易に特定できる）や文（句点毎に区切る）、節や句を特定し、テキストをこれらの単位で分割しても良い。上記の内のいくつかの方法では、入力テキスト２に対して形態素解析３２を行う必要があるため、この場合、部分文字列生成３１と形態素解析３２の処理の順序は逆になるが、実施上問題になることでは無い。 First, in the partial character string generation 31, the input text 2 is divided into syntactically / semantically / represented characters or symbols. In the present embodiment, the text is divided using a character string designated in advance as a boundary condition. These boundary conditions are stored as the partial character string extraction collation character string 172 in the parameter setting data 17 (see FIG. 7). In FIG. 7, symbol characters such as a punctuation mark, a punctuation mark, a comma, and a semicolon, a line feed code (¥ n), and a specific character string such as “characteristically” are defined as boundary conditions. Of course, in addition to this method, it is possible to forcibly divide by the number of characters or words defined in advance from the beginning of the text. You can specify string boundaries, or morphologically and parse the text to identify paragraphs (which can be easily identified from line breaks and indents), sentences (separated by breakpoints), clauses and phrases The text may be divided by these units. In some of the above methods, it is necessary to perform the morphological analysis 32 on the input text 2, and in this case, the processing order of the partial character string generation 31 and the morphological analysis 32 is reversed. Not a problem.

次に、形態素解析３２では、各単語に関する属性データが登録されている単語辞書４および、単語の品詞間の接続コストや文法ルール等が定義されている文法辞書５を参照して入力テキスト２および部分文字列の各々を単語単位に分割し、各単語に対する見出し，標準形，品詞データを単語辞書４から取得し、単語テーブル８に単語の出現順に格納する。形態素解析３２については文書処理分野では広く使われている公知技術であるため、その処理アルゴリズムの詳細についてはこれ以上言及しない。 Next, in the morphological analysis 32, the input text 2 and the grammar dictionary 4 in which attribute data relating to each word is registered and the grammar dictionary 5 in which the connection cost between parts of speech and grammar rules are defined are referred to. Each of the partial character strings is divided into words, and the heading, standard form, and part-of-speech data for each word are obtained from the word dictionary 4 and stored in the word table 8 in the order in which the words appear. Since the morphological analysis 32 is a well-known technique widely used in the field of document processing, details of its processing algorithm will not be described any further.

次に、不要語除去３３では、形態素解析３２によって分割された単語の各々が不要語辞書６に登録されている不要語であるか否かを判別し、不要語辞書６に登録されている単語を不要語としてキーワード候補から除外する。本実施形態では、不要語辞書６は１レコードに不要語文字列１語を保持したデータ構造を持ち、検索対象文書の分野や内容，文書構造に依存した不要語（特許公開公報の場合、「発明」「請求項」「特徴」等）と、検索対象文書の分野や内容に依存しない、一般に高頻度で使われる不要語（「こと」「もの」「いる」等）が登録されている。形態素解析３２で得られた各単語が不要語か否かの判別結果は単語テーブル８に不要語フラグ（値が１なら不要語）として格納される。なお、本実施形態では、不要語辞書６を、単語辞書４とは別なものとしているが、単語が不要語であるか否かの情報を単語辞書４の中に埋め込んで定義しても良い。 Next, in the unnecessary word removal 33, it is determined whether each of the words divided by the morphological analysis 32 is an unnecessary word registered in the unnecessary word dictionary 6, and the word registered in the unnecessary word dictionary 6. Is excluded from keyword candidates as an unnecessary word. In the present embodiment, the unnecessary word dictionary 6 has a data structure in which one unnecessary word character string is stored in one record, and an unnecessary word (in the case of a patent publication, “ Invention, “claims”, “features”, etc.) and unnecessary words (“thing”, “thing”, “is”, etc.) that are generally used frequently and do not depend on the field or content of the search target document. The determination result as to whether or not each word obtained by the morphological analysis 32 is an unnecessary word is stored in the word table 8 as an unnecessary word flag (an unnecessary word if the value is 1). Although the unnecessary word dictionary 6 is different from the word dictionary 4 in this embodiment, information indicating whether or not a word is an unnecessary word may be embedded in the word dictionary 4 and defined. .

次に、キーワード特定３４では、単語テーブル８に格納されたキーワード候補（不要語フラグの値が０である単語）について、予め定義された品詞を持つ単語をキーワードとして認定する。本実施形態では、キーワードを規定する品詞の情報はパラメータ設定データ１７（図７のキーワード品詞リスト１７１）に列挙されるが、処理プログラムの中にキーワードとなり得る品詞情報を埋め込んでも良い。形態素解析３２で得られた各単語がキーワードであるか否かの判別結果は単語テーブル８にキーワードフラグ（値が１ならキーワード）として格納される。 Next, in the keyword specification 34, for the keyword candidates (words whose unnecessary word flag value is 0) stored in the word table 8, a word having a predefined part of speech is recognized as a keyword. In this embodiment, the part-of-speech information that defines a keyword is listed in the parameter setting data 17 (keyword part-of-speech list 171 in FIG. 7), but part-of-speech information that can be a keyword may be embedded in the processing program. The determination result as to whether or not each word obtained by the morphological analysis 32 is a keyword is stored in the word table 8 as a keyword flag (a keyword if the value is 1).

次に、キーワード重み付与３５では、抽出されたキーワードについて、入力テキスト内での出現頻度（ＴＦ）および文書データ１３に格納された検索対象文書群での出現文書数の逆数（ＩＤＦ）を用いて、キーワードの重要度に相当する重みを算出してキーワードに付与する。本実施形態においては、式（１）によって重みを算出する。 Next, in the keyword weighting 35, for the extracted keyword, the appearance frequency (TF) in the input text and the reciprocal number (IDF) of the number of appearance documents in the search target document group stored in the document data 13 are used. The weight corresponding to the importance of the keyword is calculated and assigned to the keyword. In the present embodiment, the weight is calculated by equation (1).

ここでＤＦは文書データ１３に格納された検索対象文書群でのキーワード出現文書数、Ｎは文書データ１３に格納されている検索対象文書の総数である。 Here, DF is the number of keyword appearing documents in the search target document group stored in the document data 13, and N is the total number of search target documents stored in the document data 13.

キーワード重み付与３５では、単語テーブル８に格納されたキーワード毎にその出現頻度をカウントし、文書データインデクスデータ１４に予め格納されている各キーワード毎の出現文書数（ＤＦ）を取得して、上記計算式によってキーワードの重みを算出する。各キーワードの標準形および出現頻度，出現文書数，重みの値は、キーワードテーブル９に格納される。 In the keyword weighting 35, the appearance frequency is counted for each keyword stored in the word table 8, the number of appearance documents (DF) for each keyword stored in the document data index data 14 is acquired, and The weight of the keyword is calculated by a calculation formula. The standard form of each keyword, the appearance frequency, the number of appearing documents, and the weight value are stored in the keyword table 9.

キーワード抽出プログラム３でキーワードおよびその重みが決定されると、次に、検索実行プログラム１０において、キーワードテーブル９に格納された重み付きキーワードを用いて文書データ１３を検索する。本システムでは、インデクス生成プログラム１５によって、文書データ１３に格納される各文書から重み付きキーワードを予め抽出しておき、文書インデクスデータ１４に格納される。また、各キーワードの出現文書数も計算され、文書インデクスデータ１４に格納される。キーワードテーブル９に格納された重み付きキーワードで構成されるキーワードベクトル（重みの値がベクトルの成分となる）と、文書インデクスデータ１４に格納された重み付きキーワードで構成されるキーワードベクトルとがなす角度の余弦を計算し、その文書の類似度（値の範囲は、−１から１の間となる）とする。類似度の算出方法についてはさまざまな手法が提案されているが、本発明は類似度の算出方法を特に制限するものではないので、これ以上詳細には言及しない。 Once the keyword extraction program 3 determines the keyword and its weight, the search execution program 10 next searches the document data 13 using the weighted keyword stored in the keyword table 9. In this system, a weighted keyword is extracted in advance from each document stored in the document data 13 by the index generation program 15 and stored in the document index data 14. Also, the number of appearance documents of each keyword is calculated and stored in the document index data 14. An angle formed by a keyword vector composed of weighted keywords stored in the keyword table 9 (weight value is a vector component) and a keyword vector composed of weighted keywords stored in the document index data 14 And the similarity of the document (value range is between -1 and 1). Various methods have been proposed for the similarity calculation method, but the present invention does not particularly limit the similarity calculation method, and thus will not be described in further detail.

検索実行プログラム１０で検索された結果は、文書ＩＤと類似度が対になって、類似度の降順にソートされて検索結果格納テーブル１１に格納される。 The results searched by the search execution program 10 are paired with the document ID and similarity, sorted in descending order of similarity, and stored in the search result storage table 11.

次に、部分検索結果和集合生成プログラム７では、検索結果に格納された部分文字列検索結果それぞれの上位から、パラメータ設定データ１７に利用者によって予め定義された件数の文書を抽出し、重複する文書ＩＤを纏めてユニーク化することにより、部分検索結果和集合１６を生成する。 Next, the partial search result union set generation program 7 extracts the document of the number of cases defined in advance by the user in the parameter setting data 17 from the top of each of the partial character string search results stored in the search result, and duplicates them. The partial search result union 16 is generated by collecting and uniqueizing the document IDs.

次に、検索結果生成表示プログラム１２では、検索結果格納テーブル１１に格納された入力テキスト全体に対する全体検索結果と、部分検索結果和集合生成プログラム７で生成された部分検索結果和集合１６との積集合を生成し、全体検索結果における順位付けを保った形で最終検索結果を生成し、入出力部１を介して利用者に検索文書ＩＤと類似度スコアを表示する。 Next, in the search result generation display program 12, the product of the entire search result for the entire input text stored in the search result storage table 11 and the partial search result union set 16 generated by the partial search result union generation program 7. A set is generated, a final search result is generated in a form that maintains ranking in the overall search result, and a search document ID and a similarity score are displayed to the user via the input / output unit 1.

図２Ａで説明した本実施形態で述べるシステムは、電子計算機で構成されるものである。図２Ｂは、図２Ａで説明した本実施形態で述べるシステムを電子計算機の構成として表示したブロック図であり、同じものには同じ参照符号を付した。２００はシステムバスである。システムバス２００には、入力手段としてのキーボード１_１、マウス１_２が接続され、出力手段としての印刷手段１_３、表示手段１_４が接続される。さらに、システムバス２００には、中央処理装置（ＣＰＵ）２０１、メモリのワークエリア２０３、メモリの格納エリア２０４が接続される。ここでは、さらに、システムバス２００にネットワーク２０７が接続され、この他端にクライアントコンピュータ２０５が接続される例を示した。図のシステム構成をサーバとして、ネットワーク２０７を介して接続されたクライアント２０５により、図２Ａで説明した各種の処理が実行される。 The system described in this embodiment described with reference to FIG. 2A is configured by an electronic computer. FIG. 2B is a block diagram showing the system described in this embodiment described in FIG. 2A as a configuration of an electronic computer, and the same components are denoted by the same reference numerals. Reference numeral 200 denotes a system bus. The system bus 200, a keyboard _{1 1} as an input means, a mouse _{1 2} are connected, the printing unit _{1 3} as an output means, the display means _{1 4} is connected. Further, a central processing unit (CPU) 201, a memory work area 203, and a memory storage area 204 are connected to the system bus 200. Here, an example in which a network 207 is connected to the system bus 200 and a client computer 205 is connected to the other end is shown. Various processes described with reference to FIG. 2A are executed by a client 205 connected via a network 207 with the system configuration illustrated in FIG.

図２Ａで説明した各種の処理が、格納エリア２０４に格納された必要なプログラムおよびデータをワークエリア２０３に読み出して、ＣＰＵ２０１により実行される。 Various processes described with reference to FIG. 2A are executed by the CPU 201 by reading the necessary programs and data stored in the storage area 204 into the work area 203.

図３は、入力テキスト２の構成の一例を示す図である。本実施形態では、図３の入力テキスト２が入力されたとして説明をする。入力テキスト２は、任意の文章を入力とする。途中に改行が入っていても構わない。 FIG. 3 is a diagram illustrating an example of the configuration of the input text 2. In the present embodiment, description will be made assuming that the input text 2 in FIG. 3 has been input. The input text 2 is an arbitrary sentence. There may be a line break in the middle.

図４は、入力テキスト２に対応した単語テーブル８の構成例を示す図である。単語テーブル８は、その単語の出現する部分文字列ＩＤ８０１、単語見出し８０２、標準形８０３、品詞８０４、不要語か否かを識別する不要語フラグ８０５（１であれば不要語）、キーワードか否かを識別するキーワードフラグ８０６（１であればキーワード）からなり、入力テキスト２における単語出現順に格納されている。ここで、部分文字列ＩＤ８０１は、部分文字列抽出照合文字列１７２（図７参照）で区分される単位ごとに、０以上の整数値をとり、入力テキスト２における出現順に値が大きくなっている。 FIG. 4 is a diagram illustrating a configuration example of the word table 8 corresponding to the input text 2. The word table 8 includes a partial character string ID 801 in which the word appears, a word heading 802, a standard form 803, a part of speech 804, an unnecessary word flag 805 for identifying whether or not the word is an unnecessary word, and whether it is a keyword. And is stored in the order of word appearance in the input text 2. Here, the partial character string ID 801 takes an integer value of 0 or more for each unit divided by the partial character string extraction collation character string 172 (see FIG. 7), and the value increases in the order of appearance in the input text 2. .

図５は、キーワードテーブル９の構成の一例を示す図である。キーワードテーブル９は、キーワード抽出プログラム３の出力となる。キーワードテーブル９は、そのキーワードの出現する部分文字列ＩＤ９０１、キーワード文字列となる標準形９０２、キーワードの出現頻度に等しいＴＦ９０３、このキーワードの出現文書数の逆数をもとに算出されたＩＤＦ９０４、ＴＦ９０３およびＩＤＦ９０４から算出される重み９０５からなる。ここで、重み９０５の値は、前述した式（１）によって算出される。ここで、部分文字列ＩＤ９０１は０以上の整数値をとり、入力テキスト２における出現順に値が大きくなっている。値が０である単語は、入力テキスト２全体から抽出されたキーワードを表している。なお、部分文字列（部分文字列ＩＤが１以上）から抽出されたキーワードの重みは、図５では当該部分文字列内におけるＴＦをもとに算出されているが、入力テキスト全体におけるＴＦをもとに算出しても良い。すなわち、当該キーワードの入力テキスト全体におけるＴＦの値（部分文字列ＩＤ９０１が０である当該キーワードのＴＦ９０３の値）を用いても良い。また、部分文字列（部分文字列ＩＤが１以上）から抽出されたキーワードの重みを算出する際に用いるＴＦ９０３の値をすべて１に固定して重みを算出しても良いし、重み９０５の値をすべて同一値に固定しても良い。 FIG. 5 is a diagram illustrating an example of the configuration of the keyword table 9. The keyword table 9 is output from the keyword extraction program 3. The keyword table 9 includes a partial character string ID 901 in which the keyword appears, a standard form 902 serving as a keyword character string, a TF 903 equal to the appearance frequency of the keyword, and IDFs 904 and TF 903 calculated based on the reciprocal of the number of appearance documents of the keyword. And a weight 905 calculated from the IDF 904. Here, the value of the weight 905 is calculated by the above-described equation (1). Here, the partial character string ID 901 takes an integer value of 0 or more, and the value increases in the order of appearance in the input text 2. A word having a value of 0 represents a keyword extracted from the entire input text 2. The weight of the keyword extracted from the partial character string (partial character string ID is 1 or more) is calculated based on the TF in the partial character string in FIG. It may be calculated as follows. That is, the TF value in the entire input text of the keyword (the value of TF 903 of the keyword whose partial character string ID 901 is 0) may be used. Further, the weights may be calculated by fixing all the values of the TF 903 used when calculating the weights of the keywords extracted from the partial character strings (partial character string ID is 1 or more), or the values of the weights 905 May be fixed to the same value.

図６は、検索結果格納テーブル１１の構成の一例を示す図である。検索結果格納テーブル１１は、検索に用いたキーワードが抽出された部分文字列ＩＤ１１０１、検索結果順位１１０２、検索文書ＩＤ１１０３、類似度１１０４からなる。検索順位１１０２は、類似度１１０４の値の高いものから順に付与される。ここで、部分文字列ＩＤ１１０１は単語テーブル８およびキーワードテーブル９における部分文字列の値と一致している。図６では、全体検索結果として類似度の高い上位１０００件の文書が格納されており、部分文字列検索結果として類似度の高い上位１００件の文書がそれぞれ格納されている。これらの件数の設定は、パラメータ設定データ１７（図７）の全体検索結果最大文書件数１７４、和集合生成適用文書件数１７３に設定されており、利用者が自由に設定できる。 FIG. 6 is a diagram illustrating an example of the configuration of the search result storage table 11. The search result storage table 11 includes a partial character string ID 1101 from which keywords used for the search are extracted, a search result rank 1102, a search document ID 1103, and a similarity 1104. The search order 1102 is assigned in order from the highest similarity 1104 value. Here, the partial character string ID 1101 matches the value of the partial character string in the word table 8 and the keyword table 9. In FIG. 6, the top 1000 documents with the highest similarity are stored as the overall search results, and the top 100 documents with the high similarity are stored as the partial character string search results, respectively. The setting of the number of cases is set to the maximum number 174 of the entire search results and the number of union generation application documents 173 of the parameter setting data 17 (FIG. 7), and can be freely set by the user.

図７は、パラメータ設定データ１７の構成の一例を示す図である。本データは本システムが検索を行うにあたり、必要となる各種パラメータの値を格納しており、利用者またはシステム管理者によって値の設定が可能である。パラメータ設定データ１７に格納されているパラメータには以下のものがある。
（１）キーワード品詞リスト１７１：
キーワードとして抽出される品詞が、普通名詞、サ変名詞、動詞、形容詞、接尾語のみであることを示している。
（２）部分文字列抽出照合文字列１７２：
入力テキストを構文的・意味的・表現された文字あるいは記号的にまとまった部分文字列に分割する際の境界条件を定義する。
（３）和集合生成適用文書件数１７３：
部分文字列検索結果から部分検索結果和集合を生成する際に用いる各部分文字列検索結果上位の文書件数を表す。
（４）全体検索結果最大文書件数１７４：
入力テキスト全体での検索結果として出力する検索文書の最大件数を表す。
（５）最終検索結果出力文書件数１７５：
最終検索結果として出力する文書件数を表す。
（６）キーワード連続数１７６：
キーワードの組を生成するために連続するいくつのキーワードを取り出すかを表す。
（７）キーワード組１７７：
いくつのキーワードの組を生成するかを表す。 FIG. 7 is a diagram illustrating an example of the configuration of the parameter setting data 17. This data stores the values of various parameters that are necessary for the system to search, and the values can be set by the user or the system administrator. The parameters stored in the parameter setting data 17 are as follows.
(1) Keyword part-of-speech list 171:
This indicates that the part of speech extracted as a keyword is only a common noun, sa-variant noun, verb, adjective, and suffix.
(2) Partial character string extraction collation character string 172:
Defines boundary conditions for dividing input text into syntactic, semantic, and expressed characters or symbolic substrings.
(3) Number of union generation application documents 173:
This represents the number of documents at the top of each partial character string search result used when generating a partial search result union from the partial character string search result.
(4) Overall search result maximum number of documents 174:
Represents the maximum number of search documents to be output as search results for the entire input text.
(5) Number of final search result output documents 175:
Indicates the number of documents to be output as the final search result.
(6) Number of consecutive keywords 176:
Represents how many consecutive keywords are taken out to generate a set of keywords.
(7) Keyword set 177:
Indicates how many keyword pairs are generated.

図８Ａおよび図８Ｂは、キーワード抽出プログラム３の処理フローの詳細の前半部および後半部を示した図であり、図８ＡのＡから図８ＢのＡに移る。 8A and 8B are diagrams showing the first half and the second half of the details of the processing flow of the keyword extraction program 3, and move from A in FIG. 8A to A in FIG. 8B.

部分文字列生成３１では、入力テキスト２に格納された入力テキストを読み込む（ステップ３１０１）。次に、ループカウンタＮを０に初期化する（ステップ３１０２）。次に、Ｎの値が入力テキストの長さよりも小さいか否かを判別し（ステップ３１０３）、小さくない場合、ステップ３１０７にスキップする。小さい場合、入力テキストの冒頭からＮバイト目から始まる文字列が、パラメータ設定データ１７の部分文字列抽出照合文字列１７２に定義された文字列のいずれか一つと一致するか否かを照合し（ステップ３１０４）、一致する場合、入力テキストにおける、当該一致した部分文字列抽出照合文字列の直後を部分文字列の境界として認定する（ステップ３１０５）。一致しない場合、ステップ３１０６にスキップする。次に、カウンタＮの値を１増加し（ステップ３１０６）、ステップ３１０３に戻る。次に、入力テキストを、境界と認定された部分で分割し、部分文字列データをワークエリアに一時格納する（ステップ３１０７）。 In the partial character string generation 31, the input text stored in the input text 2 is read (step 3101). Next, the loop counter N is initialized to 0 (step 3102). Next, it is determined whether or not the value of N is smaller than the length of the input text (step 3103). If not, the process skips to step 3107. If the character string is small, it is verified whether the character string starting from the Nth byte from the beginning of the input text matches any one of the character strings defined in the partial character string extraction verification character string 172 of the parameter setting data 17 ( Step 3104), if there is a match, the part immediately after the matched partial character string extraction collating character string in the input text is recognized as the boundary of the partial character string (Step 3105). If not, skip to step 3106. Next, the value of the counter N is incremented by 1 (step 3106), and the process returns to step 3103. Next, the input text is divided at the portion recognized as the boundary, and the partial character string data is temporarily stored in the work area (step 3107).

次に、形態素解析３２では、まず、ワークエリアに一時格納された部分文字列データを読み込む（ステップ３２０１）。次に、単語テーブル８を初期化する（ステップ３２０２）。次に、形態素解析されていない部分文字列があるか否かを判別し（ステップ３２０３）、有る場合、形態素解析を実行し（ステップ３２０４）、部分文字列ＩＤ、単語見出し、単語見出しの標準形、品詞情報を単語テーブル８に入力テキストにおける単語の出現順に格納する（ステップ３２０５）。形態素解析されていない部分文字列が無い場合、不要語除去３３の処理に移る。 Next, in the morphological analysis 32, first, partial character string data temporarily stored in the work area is read (step 3201). Next, the word table 8 is initialized (step 3202). Next, it is determined whether or not there is a partial character string that has not been subjected to morphological analysis (step 3203). If there is a partial character string, morphological analysis is executed (step 3204), and the partial character string ID, the word heading, and the standard form of the word heading The part of speech information is stored in the word table 8 in the order of appearance of the words in the input text (step 3205). If there is no partial character string that has not been subjected to morphological analysis, the process proceeds to processing of unnecessary word removal 33.

次に、不要語除去３３では、まず、単語テーブル８に不要語除去処理が未処理である単語があるか否かを判別し（ステップ３３０１）、未処理の単語が無い場合、キーワード特定３４の処理に移る。未処理の単語が有る場合、当該単語が不要語辞書６に登録されているかどうかを照合し（ステップ３３０２）、登録されている場合、単語テーブル８における当該単語の不要語フラグ８０５の値を１にする（ステップ３３０３）。登録されていない場合、ステップ３３０１に戻る。 Next, in the unnecessary word removal 33, first, it is determined whether or not there is a word in the word table 8 that has not been subjected to unnecessary word removal processing (step 3301). Move on to processing. If there is an unprocessed word, it is checked whether or not the word is registered in the unnecessary word dictionary 6 (step 3302). If registered, the value of the unnecessary word flag 805 of the word in the word table 8 is set to 1. (Step 3303). If not registered, the process returns to step 3301.

次に、キーワード特定３４では、まず、単語テーブル８にキーワード特定処理が未処理である単語があるか否かを判別し（ステップ３４０１）、未処理の単語が無い場合、ループを抜けてステップ３４０４にスキップする。未処理の単語が有る場合、当該単語の品詞がパラメータ設定データ１７のキーワード品詞リスト１７１にあるか否かを判別し（ステップ３４０２）、キーワード品詞リスト１７１に有る場合、単語テーブル８のキーワードフラグ８０６の値を１にし（ステップ３４０３）、ステップ３４０１に戻る。キーワード品詞リスト１７１に無い場合、ステップ３４０１に戻る。次に、キーワードデータをキーワードテーブル９に格納する処理が未処理である単語が単語テーブル８にあるか否かを判別し（ステップ３４０４）、未処理の単語が無い場合、ステップ３４０９にスキップする。 Next, in the keyword identification 34, first, it is determined whether or not there is a word in the word table 8 that has not been subjected to keyword identification processing (step 3401). If there is no unprocessed word, the loop is exited and step 3404 is exited. Skip to. If there is an unprocessed word, it is determined whether or not the part of speech of the word is in the keyword part of speech list 171 of the parameter setting data 17 (step 3402). If it is in the keyword part of speech list 171, the keyword flag 806 in the word table 8 is determined. Is set to 1 (step 3403), and the process returns to step 3401. If it is not in the keyword part-of-speech list 171, the process returns to step 3401. Next, it is determined whether or not there is an unprocessed word in the word table 8 for storing the keyword data in the keyword table 9 (step 3404). If there is no unprocessed word, the process skips to step 3409.

未処理の単語が有る場合、単語テーブル８における当該単語のキーワードフラグ８０６の値が１であるか否かを判別し（ステップ３４０５）、１でない場合、ステップ３４０４に戻る。１である場合、さらに、当該単語の標準形８０３がキーワードテーブル９における同一の部分文字列ＩＤを持つ標準形９０２に既に格納されているか否かをチェックし（ステップ３４０６）、既に格納されている場合、当該単語のＴＦ９０３の値を１増加させる（ステップ３４０７）。まだ格納されていない場合、単語テーブル８における当該単語の標準形８０３をキーワードテーブル９の標準形９０２に追加し、ＴＦ９０３の値を１にし、文書インデクスデータ１４から当該単語のＩＤＦ値を取得してＩＤＦ９０４に格納する（ステップ３４０８）。次に、キーワードテーブル９に格納された部分文字列ＩＤが１以上のキーワード毎に当該キーワードのＴＦ９０３の値を合計することにより、入力テキスト全体におけるキーワードのＴＦ９０３の値を算出する（ステップ３４０９）。そして、部分文字列ＩＤの値を０とし、キーワードの標準形９０２、ステップ３４０９で算出されたＴＦ９０３の値、当該キーワードのＩＤＦ値をキーワードテーブル９に格納し、キーワードテーブル９へのデータ格納を終了する（ステップ３４１０）。 If there is an unprocessed word, it is determined whether or not the value of the keyword flag 806 of the word in the word table 8 is 1 (step 3405). If not 1, the process returns to step 3404. If it is 1, it is further checked whether or not the standard form 803 of the word is already stored in the standard form 902 having the same partial character string ID in the keyword table 9 (step 3406). In this case, the value of the TF 903 of the word is increased by 1 (step 3407). If it is not stored yet, the standard form 803 of the word in the word table 8 is added to the standard form 902 of the keyword table 9, the value of TF903 is set to 1, and the IDF value of the word is obtained from the document index data 14 The ID is stored in the IDF 904 (step 3408). Next, the value of TF903 of the keyword in the entire input text is calculated by summing up the values of TF903 of the keyword for each keyword having a partial character string ID of 1 or more stored in the keyword table 9 (step 3409). Then, the value of the partial character string ID is set to 0, the standard form 902 of the keyword, the value of TF 903 calculated in step 3409, the IDF value of the keyword are stored in the keyword table 9, and the data storage in the keyword table 9 is finished. (Step 3410).

次に、キーワード重み付与３５では、キーワードテーブル９に重み付与処理が未処理であるキーワードがあるか否かをチェックし（ステップ３４１１）、有る場合、キーワードテーブル９のＴＦ９０３に自然対数を施した値にＩＤＦ９０４の値を掛け合わせた値を重み９０５に格納し（ステップ３４１２）、ステップ３４１１に戻る。未処理キーワードが無い場合、処理を終了する。 Next, in keyword weighting 35, it is checked whether or not there is a keyword for which weighting processing has not been processed in the keyword table 9 (step 3411). If there is, a value obtained by applying natural logarithm to TF903 in the keyword table 9. Is multiplied by the value of IDF 904 and stored in the weight 905 (step 3412), and the process returns to step 3411. If there is no unprocessed keyword, the process ends.

図９は、部分検索結果和集合生成プログラム７における部分検索結果和集合１６を生成する処理フローを示す図である。まず変数Ｎの値を１に初期化する（ステップ７０１）。次に、変数Ｎの値が部分文字列の総数以下であるか否かを判別し（ステップ７０２）、総数以下である場合、検索結果格納テーブル１１における部分文字列ＩＤ１１０１の値がＮである検索結果集合の上位から、パラメータ設定データ１７の和集合生成適用文書件数１７３に定義された件数だけ、検索文書ＩＤ１１０３を取り出し、ワーキングエリアに格納する（ステップ７０３）。次に、変数Ｎの値を１増加させ（ステップ７０４）、ステップ７０２に戻る。ステップ７０２で総数以下で無い場合、ループを抜けてステップ７０５に移る。 FIG. 9 is a diagram showing a processing flow for generating the partial search result union set 16 in the partial search result union generation program 7. First, the value of variable N is initialized to 1 (step 701). Next, it is determined whether or not the value of the variable N is equal to or less than the total number of partial character strings (step 702). If the value is equal to or less than the total number, the search for the partial character string ID 1101 in the search result storage table 11 is N. Search document IDs 1103 are extracted from the top of the result set by the number defined in the number of union generation application documents 173 in the parameter setting data 17 and stored in the working area (step 703). Next, the value of the variable N is incremented by 1 (step 704), and the process returns to step 702. If it is not less than the total number in step 702, the process exits the loop and proceeds to step 705.

次に、ワーキングエリアに格納されたすべての検索文書ＩＤをすべてマージしたリストを生成する（ステップ７０５）。次に、本リストをソートし、さらに重複する検索文書ＩＤを統合してユニーク化する（ステップ７０６）。そして、ソートされ、ユニークされた検索文書ＩＤを部分検索結果和集合１６にすべて格納して、処理を終了する。 Next, a list is generated by merging all the search document IDs stored in the working area (step 705). Next, the list is sorted, and duplicate search document IDs are integrated and made unique (step 706). Then, all of the sorted and unique search document IDs are stored in the partial search result union set 16, and the process ends.

図１０は、検索結果を利用者に表示する際の画面例の一例である。 FIG. 10 is an example of a screen example when the search result is displayed to the user.

本画面１００は、図２Ｂの表示手段１_４の表示画面の例であり、大きく、入力テキスト表示エリア１２０、キーワード一覧表示エリア１４０、検索結果一覧表示エリア１６０からなる。 This screen 100 is an example of a display screen of the display means _{1 4} of FIG. 2B, large, input text display area 120, the keyword list display area 140, consisting of the search result list display area 160.

入力テキスト表示エリア１２０には、利用者が検索条件となる請求項テキストを入出力部1を介して入力すると、一旦、入力されたテキストデータが表示される。この段階で、必要に応じて内容を修正することができる。次いで、検索ボタン１２１を押下することにより、検索結果を得ることができ、この結果は検索結果一覧表示エリア１６０にリストの形で表示される。また、解析ボタン１２２を押下することにより、キーワード抽出プログラム３によって抽出されたキーワード群をキーワード一覧表示エリア１４０に表示することができる。また、分割ボタン１２３を押下することにより、入力テキスト表示エリア１２０のデータをステップ３１０７でワークエリアに格納された部分文字列の表示に切り替える。図１０の表示は、分割ボタン１２３を押下した結果を示している。 In the input text display area 120, when a user inputs a claim text as a search condition via the input / output unit 1, the input text data is once displayed. At this stage, the contents can be modified as necessary. Next, a search result can be obtained by pressing the search button 121, and this result is displayed in the search result list display area 160 in the form of a list. Further, by pressing the analysis button 122, the keyword group extracted by the keyword extraction program 3 can be displayed in the keyword list display area 140. In addition, by pressing the division button 123, the data in the input text display area 120 is switched to the display of the partial character string stored in the work area in step 3107. The display in FIG. 10 shows the result of pressing the division button 123.

また、キーワード一覧表示エリア１４０については、項目を選択してソートボタン１４２を押下することにより、キーワード群を選択された項目でソートすることができる。さらに、編集ボタン１４１を押下することにより、ソートボタン１４２の押下によって得られた結果に対応した表示内容に修正することができ、再検索ボタン１４３を押下することにより、上記修正内容に応じたキーワードで再検索することができる。ここでは、キーワード一覧表示エリア１４０は、検索結果一覧１６０とともに表示しているが、解析ボタン１２２を検索前に押下することによって、キーワード一覧１４０を表示し、その内容を修正してから検索を行って、その結果を検索結果一覧１６０に表示することも可能である。 In the keyword list display area 140, the keyword group can be sorted by the selected item by selecting the item and pressing the sort button 142. Further, by pressing the edit button 141, it is possible to correct the display content corresponding to the result obtained by pressing the sort button 142, and by pressing the re-search button 143, a keyword corresponding to the correction content is displayed. You can search again. Here, the keyword list display area 140 is displayed together with the search result list 160. However, by pressing the analysis button 122 before the search, the keyword list 140 is displayed, and the search is performed after correcting the contents. The result can be displayed in the search result list 160.

検索結果一覧表示エリア１６０では、検索実行プログラム１０で検索された結果が表示される。表示内容のソート（ソートボタン１６１）や表示スクロール（前頁ボタン１６２、次頁ボタン１６３）、個別の文書内容の表示（内容表示ボタン１６４）ができる。 In the search result list display area 160, the results searched by the search execution program 10 are displayed. Display contents can be sorted (sort button 161), display scrolling (previous page button 162, next page button 163), and individual document contents (content display button 164) can be displayed.

検索を終了する時は、終了ボタン１８０を押し下げれば良い。 To end the search, the end button 180 may be pressed down.

本画面において、利用者が検索結果一覧表示エリア１６０に表示されている任意の文書を指定する（本エリア最左端のチェックボックスをチェックする）と、当該文書の記載内容が入力テキストのどの部分文字列と関係が深いのかを特定し、入力テキスト表示エリアにおいて、当該関係の深い部分文字列を他と異なる態様（ここでは左端に記号「◎」が付いている）で表示する。もちろん、当該部分文字列の色やフォント、大きさ、背景色などを変えて表示しても良い。これにより、利用者は、検索結果一覧表示エリア１６０で指定した文書の検索のレベルを決めた部分文字列あるいはキーワードを容易に知ることができる。 On this screen, when the user designates an arbitrary document displayed in the search result list display area 160 (check the check box at the left end of this area), the description content of the document is any partial character of the input text. Whether the relation is deep with the column is specified, and the partial character string having the deep relation is displayed in a different form (here, the symbol “◎” is attached to the left end) in the input text display area. Of course, the color, font, size, background color, etc. of the partial character string may be changed and displayed. Thus, the user can easily know the partial character string or keyword that determines the search level of the document designated in the search result list display area 160.

以下、部分文字列あるいはキーワードと文書との関係の深さを特定する二つの方法について述べる。 The following describes two methods for specifying the depth of the relationship between a partial character string or keyword and a document.

第一の方法は、利用者が選択した文書が、各部分検索結果において何位にランクされているかをそれぞれ求め、その順位が最も高い検索結果に対応する部分文字列を関係が深いものとみなすものである。部分検索結果は検索結果格納テーブル１１に文書ＩＤとともに格納されているので、特定の文書がそれぞれ何位にランクされているかを特定する処理は容易に実現できる。また、部分文字列が決まれば、それに含まれるキーワードはキーワードテーブル９から容易に抽出できる。 The first method is to determine how many ranks each document selected by the user is ranked in each partial search result, and regard the partial character string corresponding to the search result with the highest ranking as having a deep relationship. Is. Since the partial search result is stored in the search result storage table 11 together with the document ID, it is possible to easily realize the process of specifying the rank of each specific document. If a partial character string is determined, keywords included in the partial character string can be easily extracted from the keyword table 9.

例えば、図１０の入力テキストを構成する五つの部分文字列による文書Ｄｏｃ１の順位がそれぞれ、１５０位、１位、３位、５位、２位である場合、２番目の部分文字列の検索順位が最も高いので、入力テキスト表示エリア１２０において２番目の部分文字列の左端に記号「◎」を付ける。また、この部分文字列に含まれるキーワードは、「キーワード」と「抽出」なので、キーワード一覧表示エリア１４０においてこれらのキーワード表示行の左端に記号「◎」を付ける。 For example, when the ranking of the document Doc1 by the five partial character strings constituting the input text of FIG. 10 is 150th, 1st, 3rd, 5th, 2nd, respectively, the search order of the second partial character string Therefore, the symbol “◎” is added to the left end of the second partial character string in the input text display area 120. Since the keywords included in this partial character string are “keyword” and “extraction”, a symbol “◎” is added to the left end of these keyword display lines in the keyword list display area 140.

第二の方法は、利用者が選択した文書について、入力テキストから抽出された重み付き全体キーワードによる検索順位と、部分文字列による部分検索結果の順位とを比較し、部分文字列による部分検索結果の順位の方が高い場合、その部分検索結果に対応する部分文字列を関係が深いものとみなすものである。ここで、単に「順位が高い」だけでなく、順位が「どのくらい高い」かという閾値を設けても良い。すなわち、例えば、全体キーワードによる検索順位が５０位で、ある部分文字列による検索順位が２０位であり、順位が「どのくらい高い」かの閾値が「順位差が２０位以上」である場合、順位が５０位から２０位まで３０位分高くなっているので、当該部分文字列は関係が深いものとする、としても良い。閾値の設定は、順位差でなく、順位の値の絶対的な値（２０位以内）または相対的な値（５０％以内）というような設定の仕方でも良い。 The second method compares the search rank of the weighted whole keyword extracted from the input text with the partial search result rank of the partial character string for the document selected by the user, and the partial search result of the partial character string. Is higher, the partial character string corresponding to the partial search result is regarded as having a deep relationship. Here, not only “high ranking” but also a threshold “how high” may be provided. That is, for example, when the search ranking based on the entire keyword is 50th, the search ranking based on a partial character string is 20th, and the threshold of “how high” is “ranking difference is 20th or higher” Is higher by the 30th place from the 50th place to the 20th place, so the partial character string may be closely related. The setting of the threshold value may be a setting method such as an absolute value of the rank value (within 20th place) or a relative value (within 50%) instead of the rank difference.

図１１は、検索結果を利用者に出力表示する画面の他の一例を示す図である。図１０と違う点は、利用者が指定する内容の違いである。すなわち、図１０では利用者は検索結果一覧表示エリア１６０の任意の文書を指定したが、図１１では入力テキストに表示された任意の部分文字列を指定する。このとき、キーワード一覧表示エリア１４０には、当該選択された部分文字列に含まれるキーワードの左端に記号「◎」が付加される。また、検索結果一覧表示エリア１６０には、当該部分文字列に関係の深い文書の左端に記号「◎」が付加される。もちろん、当該関係の深い文書情報の色やフォント、大きさ、背景色などを変えて表示しても良い。 FIG. 11 is a diagram showing another example of a screen for outputting and displaying the search result to the user. The difference from FIG. 10 is the difference in content specified by the user. That is, in FIG. 10, the user designates an arbitrary document in the search result list display area 160, but in FIG. 11, designates an arbitrary partial character string displayed in the input text. At this time, in the keyword list display area 140, the symbol “◎” is added to the left end of the keyword included in the selected partial character string. In the search result list display area 160, a symbol “◎” is added to the left end of a document closely related to the partial character string. Of course, the color, font, size, background color, etc. of the document information that is closely related may be changed and displayed.

以下、当該部分文字列と関係が深い文書を特定する二つの方法について述べる。 Hereinafter, two methods for specifying a document closely related to the partial character string will be described.

第一の方法は、当該指定された部分文字列による検索結果をそのまま表示するものである。これは、検索結果格納テーブル１１に格納された当該部分文字列による検索結果をそのまま表示すれば良い。 The first method is to display the search result by the designated partial character string as it is. For this purpose, the search result by the partial character string stored in the search result storage table 11 may be displayed as it is.

第二の方法は、利用者が選択した部分文字列について、入力テキストから抽出された重み付き全体キーワードによる検索順位と、当該部分文字列による部分検索結果の順位とを比較し、当該部分文字列による部分検索結果の順位の方が高い文書を、当該部分文字列と関係が深いものとみなすものである。ここで、単に「順位が高い」だけでなく、順位が「どのくらい高い」かという閾値を設けても良い。すなわち、例えば、当該部分文字列に関して、ある文書の全体キーワードによる検索順位が５０位で、当該部分文字列による検索順位が２０位であり、順位が「どのくらい高い」かの閾値が「順位差が２０位以上」である場合、順位が５０位から２０位まで３０位分高くなっているので、当該部分文字列は関係が深いものとする、としても良い。閾値の設定は、順位差でなく、順位の値の絶対的な値（２０位以内）または相対的な値（５０％以内）というような設定の仕方でも良い。 The second method compares the search order of the weighted whole keyword extracted from the input text with the partial search result rank of the partial character string and compares the partial character string selected by the user with the partial character string. A document with a higher partial search result ranking is considered to be closely related to the partial character string. Here, not only “high ranking” but also a threshold “how high” may be provided. That is, for example, with respect to the partial character string, the search rank of the entire keyword of a document is 50th, the search rank of the partial character string is 20th, and the threshold of “how high” is “the rank difference is In the case of “20th or higher”, the ranking is higher by the 30th place from the 50th place to the 20th place, so the partial character string may be deeply related. The setting of the threshold value may be a setting method such as an absolute value of the rank value (within 20th place) or a relative value (within 50%) instead of the rank difference.

上記該当する文書を他と異なる態様で表示する際に、該当する文書を検索結果一覧表示エリア１６０の冒頭部分に移動させてまとめて表示しても良いし、該当する文書のみを検索結果一覧表示エリア１６０に表示しても良い。 When displaying the corresponding document in a different mode from the others, the corresponding document may be moved to the beginning of the search result list display area 160 and displayed together, or only the corresponding documents are displayed in the search result list. It may be displayed in the area 160.

次に、上記実施形態の変形例について述べる。 Next, a modification of the above embodiment will be described.

（１）分類データの利用
上記実施形態では、入力テキストを部分文字列に分割し、部分文字列毎に行なった検索結果の上位をマージして部分検索結果集合を生成し、入力テキスト全体の検索結果との積集合を生成することによって、検索精度を向上させるものであった。ここで、部分文字列を単位として検索する場合、入力となるキーワードの中にノイズとなる語が少なくなる反面、キーワード数が減少することもあり、部分文字列がどの分野に関する記述であるかに関する情報が欠落することが多い。その結果、全く違う分野に関する文書が検索結果の上位に現れてしまう。そこで、入力テキスト全体からその記述に対応する分野を特定し、その分野に関連する文書だけを部分文字列検索結果とすることにより、全く違う分野の文書を除外する。 (1) Utilization of classification data In the above embodiment, the input text is divided into partial character strings, and the upper part of the search results performed for each partial character string is merged to generate a partial search result set, and the entire input text is searched. The search accuracy was improved by generating a product set with the results. Here, when searching for a partial character string as a unit, the number of keywords may decrease while the number of input keywords is reduced, and the number of keywords may decrease. Information is often missing. As a result, documents related to completely different fields appear at the top of the search results. Therefore, a field corresponding to the description is specified from the entire input text, and only documents related to the field are used as a partial character string search result, thereby excluding documents in completely different fields.

分野を特定する方法として最も良く使われるのは、文書毎に分類コードまたは統制語（シソーラス語）を付与するものである。本実施形態で扱う特許公開公報には、種々の分類コードが付与されているので、容易に実現可能である。また、入力テキストの内容がどの分野に関する（どの分類に属する）ものかを特定する方法としては、予め分類毎にキーワードを定義しておき、入力テキスト内のキーワードとの類似性を算出して関連の深い分類を特定する方法や、入力テキストを用いて検索実行プログラム１０で検索を行い、その検索結果の上位の文書にどの分類が多く付与されているかを統計分析することによって、関連の深い分類を特定する方法が有効である。 The most commonly used method for identifying a field is to assign a classification code or controlled word (thesaurus) to each document. Since various patent codes are assigned to the patent publications dealt with in the present embodiment, it can be easily realized. In addition, as a method of specifying which field the content of the input text belongs to (which classification belongs to), a keyword is defined for each classification in advance, and the similarity with the keyword in the input text is calculated and related. A method of identifying a deep classification, or a search execution program 10 using an input text, and statistically analyzing which classification is assigned to a higher-order document of the search result, thereby providing a deeply related classification It is effective to specify this.

図１において、検索実行１０３を行う前に、上記方法によって入力テキスト１０１に関連する分類コードを認定しておく。各部分文字列に対して検索を実行して得られた部分検索結果に含まれる文書について、上記認定された分類コードが付与されているかを判別し、付与されていない文書については、部分検索結果から除外する。その他の処理については同一である。分類コードは、上記検索による他に、例えば、入力テキストの表示エリア１２０に入力窓を設けて、利用者が入力するものとしても良い。
（２）請求項末尾の発明対象記述部分の利用
本変形例は、入力テキストが請求項テキストである場合に有効である。上述した分類データを利用した検索の代わりに、入力された請求項における末尾の名詞句（「〜を特徴とする」と言った固有表現の直後に記載される名詞句、以下「発明対象記述部分」と呼ぶ）の情報を使う。この発明対象記述部分は、分類コード同様、分野を限定する手掛かりとして有効である。そこで、単に入力テキストを部分文字列に分割してそれぞれ検索するのではなく、この発明対象記述部分に出現するキーワードを、各部分文字列から抽出されるキーワードに追加して検索することにより、全く違う分野の検索結果が出力されるのを防止する。図１において、発明対象記述部分に相当するのは、部分文字列４の文字列であり、この部分文字列から抽出されるキーワードはＫＷ２、ＫＷ３、ＫＷ８である。そこで、これらのキーワードを各部分文字列から抽出されたキーワード１０２ａ、１０２ｂ、１０２ｃに追加して、検索を実行する。 In FIG. 1, before performing the search execution 103, the classification code related to the input text 101 is recognized by the above method. For documents included in partial search results obtained by performing a search on each partial character string, it is determined whether the certified classification code is assigned. For documents that are not assigned, partial search results Exclude from Other processes are the same. In addition to the above search, the classification code may be input by the user by providing an input window in the input text display area 120, for example.
(2) Use of invention subject description part at the end of claim This modification is effective when the input text is the claim text. Instead of the search using the classification data described above, the last noun phrase in the input claim (the noun phrase described immediately after the proper expression such as “characterized by”, hereinafter “invention object description part” Information). This invention object description part is effective as a clue to limit the field as well as the classification code. Therefore, instead of simply dividing the input text into partial character strings and searching for each, the keyword appearing in the description part of the invention is added to the keyword extracted from each partial character string, and the search is performed. Prevent search results in different fields from being output. In FIG. 1, the character string of the partial character string 4 corresponds to the invention object description portion, and the keywords extracted from this partial character string are KW2, KW3, and KW8. Therefore, these keywords are added to the keywords 102a, 102b, and 102c extracted from each partial character string, and a search is executed.

なお、本変形例と、上述した分類を利用する方法を両方適用しても良い。
（３）入力テキスト内の連続するキーワードの利用
本実施形態では、入力テキストを部分文字列に分割し、各部分文字列からキーワードを抽出したが、本変形例では、まず最初に入力テキストからキーワードを抽出して入力テキスト内での出現順に並べたリストを生成し、当該リストにおいて連続する複数個のキーワードからなる「キーワードの組」を取り出し、これを重み付き部分文字列キーワードの代わりに用いる。本変形例のメリットは、入力テキストを部分文字列に分割する処理が不要となることにある。 In addition, you may apply both this modification and the method of using the classification | category mentioned above.
(3) Use of consecutive keywords in input text In this embodiment, the input text is divided into partial character strings, and keywords are extracted from each partial character string. In this modification, first, keywords are extracted from the input text. Is extracted to generate a list arranged in the order of appearance in the input text, a “keyword set” consisting of a plurality of consecutive keywords in the list is extracted, and this is used in place of the weighted partial character string keyword. The advantage of this modification is that it is not necessary to divide the input text into partial character strings.

図１２は、入力テキスト内の連続するキーワードの利用による実施形態の概要を図１に対応した形で示した図である。図１と図１２で異なるのは、重み付き部分文字列キーワードの抽出の仕方である。すなわち、まず、入力テキスト１０１を解析して出現順キーワードリスト１０００を生成する。単語テーブル８（図４参照）においてキーワードフラグ８０６の値が１である単語の標準形８０３を順に抽出することによって実現可能である。次に、本リスト１０００を最初からスキャンし、連続する３つのキーワードを取り出してキーワードの組を生成する。この「３つ」という値の設定は、パラメータ設定データ１７のキーワード連続数１７６に定義され、利用者またはシステム管理者によって設定可能である。次に、キーワードへ重みを付与するが、その際、ＴＦの値を１として重みを算出する。ここで、ＴＦの値を１として算出された重みを使う代わりに、入力テキスト全体のＴＦを用いて重みを算出しても良い。すなわち、図１２において、重み付き部分文字列キーワード１０２ａ〜１０２ｄにおける重みの値は、重み付き全体キーワード１０２における重みの値と一致する。
（４）入力テキスト内の任意のキーワードの利用
入力テキスト内の連続するキーワードを利用する代わりに、入力テキスト全体から抽出されたキーワードから任意のＮ個のキーワードを取り出し、このキーワードの組で検索を行うものである。入力テキストの内容を特徴付ける重要キーワードは必ずしも連続して出現するとは限らず、入力テキストに分散して現れることもある。本変形例は、そのような場合において重要なキーワードの組を生成するのに有効である。 FIG. 12 is a diagram showing an outline of the embodiment using the continuous keyword in the input text in a form corresponding to FIG. The difference between FIG. 1 and FIG. 12 is the method of extracting weighted partial character string keywords. That is, first, the input text 101 is analyzed to generate the appearance order keyword list 1000. This can be realized by sequentially extracting the standard forms 803 of the words whose keyword flag 806 is 1 in the word table 8 (see FIG. 4). Next, this list 1000 is scanned from the beginning, three consecutive keywords are extracted, and a set of keywords is generated. The setting of the value “three” is defined in the keyword sequence number 176 of the parameter setting data 17 and can be set by the user or the system administrator. Next, a weight is assigned to the keyword. At this time, the weight is calculated by setting the value of TF to 1. Here, instead of using the weight calculated with a TF value of 1, the weight may be calculated using the TF of the entire input text. That is, in FIG. 12, the weight value in the weighted partial character string keywords 102 a to 102 d matches the weight value in the weighted overall keyword 102.
(4) Use of arbitrary keywords in the input text Instead of using consecutive keywords in the input text, any N keywords are extracted from the keywords extracted from the entire input text, and a search is performed with this set of keywords. Is what you do. Important keywords that characterize the contents of the input text do not necessarily appear continuously, and may appear dispersed in the input text. This modification is effective in generating an important keyword set in such a case.

図１３は、入力テキスト全体から抽出されたキーワードから任意のＮ個のキーワードの利用による実施形態の概要を図１に対応した形で示した図である。図１２と図１３で異なるのは、図１２がキーワードの出現順序を考慮しているのに対して、図１３では、それを考慮していない点である。すなわち、入力テキスト１０１から重み付き全体キーワード１０２を抽出し、この中から任意の３種類のキーワードを取り出す。この「３種類」という値は、パラメータ設定データ１７の「キーワードの組１７７」として定義され、利用者またはシステム管理者が自由に設定できる。重み付けは、ＴＦの値を１として算出しても良いし、重み付き全体キーワード１０２の重みの値を継承しても良い。
（５）検索回数を一度だけにする
上述した実施形態では、部分文字列の数が増えるほど、または、キーワードの組の数が増えるほど検索実行プログラム１０において実行される検索回数が多くなる。その結果、検索の性能が低下してしまう。本変形例では、検索回数を１回だけとすることにより、検索性能を向上させる。 FIG. 13 is a diagram showing an outline of an embodiment using arbitrary N keywords from keywords extracted from the entire input text in a form corresponding to FIG. FIG. 12 differs from FIG. 13 in that FIG. 12 considers the appearance order of keywords, whereas FIG. 13 does not consider it. That is, the weighted overall keyword 102 is extracted from the input text 101, and any three types of keywords are extracted from the extracted keyword. The value “three types” is defined as “keyword set 177” of the parameter setting data 17 and can be freely set by the user or the system administrator. For weighting, the value of TF may be calculated as 1, or the weight value of the weighted overall keyword 102 may be inherited.
(5) Only one search is performed In the embodiment described above, the number of searches executed in the search execution program 10 increases as the number of partial character strings increases or the number of keyword sets increases. As a result, search performance is degraded. In this modification, the search performance is improved by limiting the number of searches to one.

検索は、図１における重み付き全体キーワード１０２を使って１回だけ行なわれる。これにより、検索結果文書が出力されるが、各検索結果文書には類似度スコアが付いている。本変形例では、文書毎に類似度を出す際に、重み付き全体キーワード１０２に含まれる各キーワードが類似度算出スコアに占める割合（相対値でも絶対値でも良い）を抽出する。例えば、類似度が、入力テキストから抽出されたキーワードベクトルと検索対象文書から抽出されたキーワードベクトルとの内積によって算出される場合、類似度スコアは、キーワードベクトルの各ベクトル成分の積の合計となるので、各キーワードのベクトル成分の積の値がそのキーワードが類似度スコアに占める寄与度を表すことになる。 The search is performed only once using the weighted overall keyword 102 in FIG. As a result, a search result document is output, but each search result document has a similarity score. In the present modification, when the similarity is calculated for each document, the ratio of each keyword included in the weighted overall keyword 102 to the similarity calculation score (which may be a relative value or an absolute value) is extracted. For example, when the similarity is calculated by the inner product of the keyword vector extracted from the input text and the keyword vector extracted from the search target document, the similarity score is the sum of the products of the vector components of the keyword vector. Therefore, the value of the product of the vector components of each keyword represents the degree of contribution that the keyword occupies in the similarity score.

このように、類似度スコアのほかに、この「キーワード別類似度」が得られると、部分文字列キーワードで検索して検索結果文書を得る代わりに、部分文字列キーワードとして使われたキーワード各々に対して、上記キーワード別類似度を抽出し、その総和を文書の類似度とみなす。これによって、何度の検索をすることなく、各部分文字列キーワードの検索結果文書を得ることができる。 In this way, in addition to the similarity score, when this “similarity by keyword” is obtained, instead of searching for a partial string keyword and obtaining a search result document, each keyword used as the partial string keyword is searched for. On the other hand, the keyword similarity is extracted, and the sum is regarded as the document similarity. Thereby, a search result document of each partial character string keyword can be obtained without performing many searches.

本発明の実施形態の概要を示す図である。It is a figure which shows the outline | summary of embodiment of this invention. 本発明の実施形態で述べるシステムを利用者の操作、各種データおよびデータの処理に関するプログラムを関連付けて表示したブロック図である。It is the block diagram which linked | related and displayed the program regarding a user's operation, various data, and data processing about the system described by embodiment of this invention. 図２Ａで説明した本実施形態で述べるシステムを電子計算機の構成として表示したブロック図である。It is the block diagram which displayed the system described by this embodiment demonstrated in FIG. 2A as a structure of an electronic computer. 本発明の実施形態の入力テキスト２の構成の一例を示す図である。It is a figure which shows an example of a structure of the input text 2 of embodiment of this invention. 本発明の実施形態の入力テキスト２に対応した単語テーブル８の構成の一例を示す図である。It is a figure which shows an example of a structure of the word table 8 corresponding to the input text 2 of embodiment of this invention. 本発明の実施形態のキーワードテーブル９の構成の一例を示す図である。It is a figure which shows an example of a structure of the keyword table 9 of embodiment of this invention. 本発明の実施形態の検索結果格納テーブル１１の構成の一例を示す図である。It is a figure which shows an example of a structure of the search result storage table 11 of embodiment of this invention. 本発明の実施形態のパラメータ設定データ１７の構成の一例を示す図である。It is a figure which shows an example of a structure of the parameter setting data 17 of embodiment of this invention. 本発明の実施形態のキーワード抽出プログラム３の処理フローの前半を示す図である。It is a figure which shows the first half of the processing flow of the keyword extraction program 3 of embodiment of this invention. 本発明の実施形態のキーワード抽出プログラム３の処理フローの後半を示す図である。It is a figure which shows the second half of the processing flow of the keyword extraction program 3 of embodiment of this invention. 本発明の実施形態の部分検索和集合生成プログラム７の処理フローを示す図である。It is a figure which shows the processing flow of the partial search union production | generation program 7 of embodiment of this invention. 本発明の実施形態における検索結果出力画面の一例を示す図である。It is a figure which shows an example of the search result output screen in embodiment of this invention. 本発明の実施形態における検索結果出力画面の他の例を示す図である。It is a figure which shows the other example of the search result output screen in embodiment of this invention. 入力テキスト内の連続するキーワードの利用による本発明の実施形態の概要を図１に対応した形で示した図である。It is the figure which showed the outline | summary of embodiment of this invention by utilization of the continuous keyword in an input text in the form corresponding to FIG. 入力テキスト全体から抽出されたキーワードから任意のＮ個のキーワードの利用による本発明の実施形態の概要を図１に対応した形で示した図である。It is the figure which showed the outline | summary of embodiment of this invention by the use of arbitrary N keywords from the keyword extracted from the whole input text in the form corresponding to FIG.

Explanation of symbols

１…入出力部、１_１…キーボード、１_２…マウス、１_３…印刷手段、１_４…表示手段、２…入力テキスト、３…キーワード抽出プログラム、４…単語辞書、５…文法辞書、６…不要語辞書、７…部分検索結果和集合生成プログラム、８…単語テーブル、９…キーワードテーブル、１０…検索実行プログラム、１１…検索結果データ、１２…検索結果表示プログラム、１３…文書データ、１４…文書インデクスデータ、１５…インデクス生成プログラム、１６…部分検索結果和集合、１７…パラメータ設定データ、１００…表示画面、１２０…入力テキスト表示エリア、１２１…検索ボタン、１２２…解析ボタン、１２３…リセットボタン、１４０…キーワード一覧表示エリア、１４１…編集ボタン、１４２…ソートボタン、１４３…再検索ボタン、１６０…検索結果一覧表示エリア、１６１…ソートボタン、１６２…前頁ボタン、１６３…次頁ボタン、１６４…内容表示ボタン、１８０…終了ボタン、２００…システムバス、２０１…ＣＰＵ、２０３…メモリのワークエリア、２０４…メモリの格納エリア、２０５…クライアント、２０７…ネットワーク。
DESCRIPTION OF SYMBOLS 1 ... Input / output part, 1 ₁ ... Keyboard, 1 ₂ ... Mouse, 1 ₃ ... Printing means, 1 ₄ ... Display means, 2 ... Input text, 3 ... Keyword extraction program, 4 ... Word dictionary, 5 ... Grammar dictionary, 6 ... unnecessary word dictionary, 7 ... partial search result union generation program, 8 ... word table, 9 ... keyword table, 10 ... search execution program, 11 ... search result data, 12 ... search result display program, 13 ... document data, 14 ... Document index data, 15 ... Index generation program, 16 ... Partial search result union, 17 ... Parameter setting data, 100 ... Display screen, 120 ... Input text display area, 121 ... Search button, 122 ... Analyze button, 123 ... Reset Button 140 ... Keyword list display area 141 ... Edit button 142 ... Sort button 143 ... Re Search button 160 ... Search result list display area 161 ... Sort button 162 ... Previous page button 163 ... Next page button 164 ... Content display button 180 ... End button 200 ... System bus 201 ... CPU 203 ... Memory work area 204. Memory storage area 205 Client 207 Network

Claims

Document retrieval method using a computer comprising a storage device for storing text input by a user, a document to be searched, a weighted keyword extracted in advance from each document, and a central processing unit accessible to the storage device In
The central processing unit analyzing the input text in the storage device to extract keywords; and the central processing unit assigns a weight corresponding to the importance to each of the extracted keywords. The assigning step and the central processing unit comparing the weighted keyword with the pre-extracted weighted keyword in the storage device and calculating a similarity for each document in the storage device The central processing unit outputting a document having a high similarity as a search result, and the central processing unit dividing the inputted text in the storage device into a plurality of partial character strings; The central processing unit extracting a partial character string keyword from each of the divided partial character strings; and the central processing unit Assigning a weight corresponding to importance to each partial character string keyword, the central processing unit, a weighted partial character string keyword extracted from the partial character string, and the storage device in the storage device Comparing with pre-extracted weighted keywords and calculating a partial character string similarity for each document in the storage device; and the central processing unit performing a partial search for a document with a high partial character string similarity Extracting as a result; generating a union of documents included in the upper part of the partial search result extracted for each of the partial character strings; and the central processing unit Extracting a document included in the union from the search results for the entire text, and the central processing unit extracting the extracted document Document search method characterized by having a step of outputting a search result.

Document retrieval method using a computer comprising a storage device for storing text input by a user, a document to be searched, a weighted keyword extracted in advance from each document, and a central processing unit accessible to the storage device In
The central processing unit analyzing the input text in the storage device to extract keywords; and the central processing unit assigns a weight corresponding to the importance to each of the extracted keywords. The assigning step and the central processing unit comparing the weighted keyword with the pre-extracted weighted keyword in the storage device and calculating a similarity for each document in the storage device The central processing unit outputting a document having a high degree of similarity as a search result; and the central processing unit includes a weighted keyword including two or more keywords from the extracted weighted keywords. A step of extracting a set; and the central processing unit includes the extracted weighted keyword set and the pre-extracted weight in the storage device. Comparing the keyword with the attached keyword, calculating a similarity for each document in the storage device, extracting the document with a high similarity as a partial search result, and executing the central processing A device generating a union of documents included in a higher rank of the partial search results extracted for each of the weighted keyword sets; and the central processing unit including the search results for the entire input text A document search method comprising: extracting a document included in the union from among the steps; and a step in which the central processing unit outputs the extracted document as a search result.

3. The document search method according to claim 2, wherein a set of weighted keywords including the two or more keywords is selected in the order of appearance of the keywords in the input text.

4. The document search method according to claim 1, wherein the step of dividing the input text into partial character strings is performed by the central processing unit, wherein the input text is a paragraph, sentence, clause, or phrase. A document search method characterized in that the division is performed for each unit, or the boundary of the partial character string is specified and divided depending on whether a specific description expression appears in the input text.

4. The document search method according to claim 1, wherein in the step of assigning a weight corresponding to the importance to each of the extracted partial character string keywords, the central processing unit includes: A document search method, comprising assigning a weight of the keyword to the entire input text as a weight of the partial character string keyword.

6. The document search method according to claim 1, wherein when the computer input device generates a union of the documents with high similarity, each of the documents with high similarity is obtained. Setting whether to generate a union by collecting, setting of how many consecutive keywords to extract when extracting a set of two or more consecutive keywords in the weighted appearance order keywords, and weighting A document search method characterized in that, when a set of keywords is extracted, one of settings for how many types of keywords is extracted is received from a user.

In a program for causing a central processing unit that can access a storage device that stores text input by a user, a document to be searched, a weighted keyword extracted in advance from each document,
Analyzing the input text in the storage device to extract keywords; assigning weights corresponding to importance to each of the extracted keywords; the weighted keywords; Comparing the previously extracted weighted keywords in the storage device to calculate a similarity for each document, dividing the input text into a plurality of partial character strings, and the divided partial characters Extracting a partial character string keyword from each of the columns, assigning a weight corresponding to the importance to each of the extracted partial character string keywords, and a weighted portion extracted from the partial character string A partial sentence for each document in the storage device by comparing a character string keyword with the previously extracted weighted keyword in the storage device A step of calculating a column similarity, a step of extracting a document with a high degree of partial character string similarity as a partial search result, and a sum of documents included at the top of the partial search result extracted for each of the partial character strings Generating a set; extracting a document included in the union from the search results for the entire input text; and outputting the extracted document as a search result. A document search program to be executed by a processing device.

In a program for causing a central processing unit accessible to a storage device that stores text input by a user, a document to be searched, a weighted keyword extracted in advance from each document,
Analyzing the input text in the storage device to extract keywords; assigning weights corresponding to importance to each of the extracted keywords; the weighted keywords; Comparing the previously extracted weighted keywords in the storage device to calculate the similarity for each document, and a set of weighted keywords comprising two or more keywords from the extracted weighted keywords A step of extracting, a step of calculating a similarity for each document by comparing the set of the extracted weighted keywords with the pre-extracted weighted keyword in the storage device, and a document having a high similarity A step of extracting as a partial search result, and a higher rank of the partial search result extracted for each of the set of weighted keywords Generating a union of documents, extracting a document included in the union from the search results for the entire input text, and outputting the extracted document as a search result A document search program for causing the central processing unit to execute steps.

9. The document search program according to claim 8, wherein a set of weighted keywords including the two or more keywords is selected in the order of appearance of the keywords in the input text.

In a document search apparatus comprising an input / output means connected to a system bus, a central processing unit, a memory work area, and a computer to which a memory storage area is connected, an input text for storing input text in the memory storage area, Acquired from a word dictionary in which attribute data related to each word is registered, a grammar dictionary in which connection costs between word parts of speech and grammar rules are defined, an unnecessary word dictionary that defines words to be excluded from keyword candidates, and input text A word table for storing stored words, a keyword table for storing words that do not correspond to words to be excluded from keyword candidates among the words stored in the word table, document data for storing a document to be searched, and the document data Document data index for search, search result data to store search results, given by user Parameter search data for storing various settings relating to search, a search program, and a document search apparatus comprising weighted keywords extracted in advance from each document, the search program being input by a user Analyzing text to extract keywords, assigning a weight corresponding to importance to each of the extracted keywords, the weighted keywords, and the pre-extracted weighted keywords in the memory And calculating a similarity for each document in the memory, dividing the input text into a plurality of partial character strings, and a partial character string from each of the divided partial character strings A step of extracting a keyword, and a degree of importance corresponding to each of the extracted partial character string keywords. A step of assigning a weight; and comparing the weighted partial character string keyword extracted from the partial character string with the pre-extracted weighted keyword in the memory to compare the partial character string similarity for each document in the memory A step of extracting a document with a high degree of partial character string similarity as a partial search result, and generating a union of documents included in the upper part of the partial search result extracted for each of the partial character strings A step of extracting a document included in the union from the search results for the entire input text, and outputting the extracted document as a search result to the central processing unit. A document search apparatus characterized by being executed.

In a document search apparatus comprising an input / output means connected to a system bus, a central processing unit, a memory work area, and a computer to which a memory storage area is connected, an input text for storing input text in the memory storage area, Acquired from a word dictionary in which attribute data related to each word is registered, a grammar dictionary in which connection costs between word parts of speech and grammar rules are defined, an unnecessary word dictionary that defines words to be excluded from keyword candidates, and input text A word table for storing stored words, a keyword table for storing words that do not correspond to words to be excluded from keyword candidates among the words stored in the word table, document data for storing a document to be searched, and the document data Document data index for search, search result data to store search results, given by user Parameter search data for storing various settings relating to search, a search program, and a document search apparatus comprising weighted keywords extracted in advance from each document, the search program being input by a user Analyzing text to extract keywords, assigning a weight corresponding to importance to each of the extracted keywords, the weighted keywords, and the pre-extracted weighted keywords in the memory And calculating a similarity for each document in the memory, extracting a set of weighted keywords composed of two or more keywords from the extracted weighted keywords, and the extraction A weighted keyword set and the pre-extracted weighted keyword in the memory; A step of calculating a similarity for each document in the memory in comparison, a step of extracting a document having a high similarity as a partial search result, and the partial search result extracted for each of the sets of weighted keywords Generating a union of documents included in a higher rank, extracting a document included in the union from the search results for the entire input text, and extracting the extracted documents as a search result The document search apparatus causing the central processing unit to execute the step of outputting as

12. The document search apparatus according to claim 11, wherein a set of weighted keywords including the two or more keywords is selected in the order in which the keywords appear in the input text.