JP2009104475A

JP2009104475A - Similar document retrieval device, and similar document retrieval method and program

Info

Publication number: JP2009104475A
Application number: JP2007276794A
Authority: JP
Inventors: Minako Izawa; 味奈子井沢; Takashi Kono; 隆志河野; Shuichi Nakawatase; 秀一中渡瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-10-24
Filing date: 2007-10-24
Publication date: 2009-05-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a similar document retrieval device and a similar document retrieval method and program capable of discovering a document where not only input retrieval character strings coincide but similar tags are attached or a document where a similar keyword is used in retrieving a similar document from a large amount of documents stored in a text format. <P>SOLUTION: In a similar document retrieval device for retrieving documents, with respect to a normalized document to be a document which normalizes words included in an input document, stores words for which a tag is attached to the document, and normalizes words in the document, a tag is generated in the document according to a setting file, an index where a word to be a clue for retrieving a similar document matches a document ID is generated, statistical information to be statistical information on names of attributes included in the document stored in the storage means is generated, and the document is retrieved on the basis of the statistical information. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、テキスト形式で格納されている大量の文書から、類似の文書を検索するシステムおよび方法に関する。
The present invention relates to a system and method for retrieving similar documents from a large number of documents stored in a text format.

従来、新聞記事、特許公報等、大量の自然言語文書の中から、目的の文書を検索する場合、様々な検索方式が提案されている。これらの検索方式を大別すると、次のようになる。 Conventionally, various search methods have been proposed for searching a target document from a large amount of natural language documents such as newspaper articles and patent publications. These search methods are roughly classified as follows.

（１）キーワード検索
キーワード検索は、個々の文書とその文書の内容を表すキーワードとの索引付けが予め行われている。この場合、キーワードを決める方法として、形態素解析等による自動キーワード抽出、人手によるキーワード付加、これらを組み合わせた方法が知られている（たとえば、特許文献１、特許文献２参照）。 (1) Keyword Search In the keyword search, indexing of individual documents and keywords representing the contents of the documents is performed in advance. In this case, as a method for determining a keyword, automatic keyword extraction by morphological analysis or the like, manual keyword addition, and a combination of these methods are known (see, for example, Patent Document 1 and Patent Document 2).

しかし、このキーワード検索では、キーワードとして索引付けされている文字列でしか検索することができず、また、形態素解析による自動キーワード抽出の精度は、単語・文法辞書の精度に左右されるので、辞書の保守に、多くの人的作業を要するという問題がある。 However, in this keyword search, only a character string indexed as a keyword can be searched, and the accuracy of automatic keyword extraction by morphological analysis depends on the accuracy of the word / grammar dictionary. There is a problem that a lot of human work is required for maintenance.

（２）曖昧検索
曖昧検索は、検索文字列と完全に一致する文字列を含む検索するのみならず、部分的に一致する文字列を含む文書も検索する方式である。 (2) Fuzzy search A fuzzy search is a method of searching not only a character string that completely matches a search character string but also a document that includes a partially matching character string.

たとえば、検索文字列に対するユーザの記憶が曖昧であり、検索文字列が様々な変形を伴って出現し得るものであり、このような変形の全てを列挙することが困難である場合がある。 For example, the user's memory for the search character string is ambiguous, the search character string may appear with various modifications, and it may be difficult to enumerate all such modifications.

従来の典型的な部分文字列指定方法は、正規表現（正式な表現）を使用するものである。これによれば、任意の文字の０回以上を繰り返し、任意の文字の１回以上の繰り返し、行末位置、行頭位置、特定の文字コードの範囲内の任意の文字等の指定が可能である（特許文献３参照）。 A conventional typical partial character string designation method uses a regular expression (formal expression). According to this, it is possible to specify an arbitrary character 0 times or more, specify an arbitrary character 1 or more times, line end position, line head position, arbitrary character within a specific character code range, etc. ( (See Patent Document 3).

しかし、これらの曖昧検索では、検索すべき文字列の曖昧性の度合いを指定することが困難であり、検索結果は、ユーザによる所望でなく、または不自然な多くの文字列を含むという問題がある。 However, in these ambiguous searches, it is difficult to specify the degree of ambiguity of the character string to be searched, and there is a problem that the search result includes many character strings that are not desired by the user or are unnatural. is there.

（３）概念ベース検索
検索文字列が一致しなくても、似たような意味の単語を含む文書や、同じカテゴリに属する文書を検索したい場合もある。 (3) Concept-based search Even if the search character strings do not match, there are cases where it is desired to search for documents containing words having similar meanings or documents belonging to the same category.

単語が持つ意味（概念）を、単語それ自身と、その単語の属性との関係を示すものを概念ベースと呼び、これを利用し、類似の文書を検索する方法が知られている。 A meaning (concept) of a word that indicates the relationship between the word itself and the attribute of the word is called a concept base, and a method of searching for similar documents using this is known.

概念ベースの構築に関しては、辞書データを使う場合が多いが、辞書にない新しい単語や固有名詞等には適用できず、「特定分野の文章で使われている意味」（文脈依存の関係）を抽出することができないという問題がある。したがって、コーパス等での共起関係の利用等が行われている（特許文献４参照）。 In terms of concept-based construction, dictionary data is often used, but it cannot be applied to new words or proper nouns that are not in the dictionary, and the “meaning used in sentences in a specific field” (context-dependent relationship). There is a problem that it cannot be extracted. Therefore, use of co-occurrence relations in corpora and the like is performed (see Patent Document 4).

しかし、概念ベース検索では、概念ベースの構築に膨大な文書と語とが必要であり、同時に処理に多大な時間が費やされる。また、固有名詞や専門語、流行語等は、常に新しいものが生み出されるので、概念ベースを常に更新し続ける必要がある。また、表記揺らぎによって、概念ベースに含まれない表記、ひらがな混じり、適切ではない表記で検索文字列が入力された場合の対応は難しいという問題がある。
特開平４−２０５５６０号公報特開平４−２１５１８１号公報特開昭６２−２２１０２７号公報特開平８−１４７３２４号公報 However, the concept-based search requires a large amount of documents and words for constructing the concept base, and at the same time, a great deal of time is spent on the processing. In addition, proper nouns, technical terms, buzzwords, etc. are constantly created, so it is necessary to constantly update the concept base. In addition, there is a problem that it is difficult to cope with a case where a search character string is input with notation that is not included in the concept base, hiragana, or notation due to notation fluctuation.
Japanese Patent Laid-Open No. 4-205560 JP-A-4-215181 JP-A-62-221027 JP-A-8-147324

上記従来技術では、いずれも検索文字列そのものが、求める文書中に存在しているか、意味が似た言葉が存在している必要がある。 In any of the above prior arts, the search character string itself must be present in the desired document or a word having a similar meaning must exist.

だが、たとえば、ユーザがＷＥＢ上から着メロ（登録商標）を入手しようとする場合、サイトに「着メロ」という単語が、必ず入っているわけではない。「着メロ」と記載されているのではなく、「着信メロディ」（登録商標）や「携帯向け音楽配信サービス」が記載されていれば、ユーザが探しているサイトであっても、検索されないという問題がある。 However, for example, when a user tries to obtain a ringtone (registered trademark) from WEB, the word “ringtone” is not necessarily included in the site. If "Ringtone" (registered trademark) or "Mobile music distribution service" is described instead of "Ringtone", even if the site is searched by the user, it will not be searched There is.

また、このようなサービスの名称は、辞書には掲載されていない新しい単語であることが多く、曖昧検索や概念ベースを使用しても、検索し難いという問題がある。 In addition, the name of such a service is often a new word that is not listed in the dictionary, and there is a problem that it is difficult to search even if an ambiguous search or a concept base is used.

つまり、従来の文章中の文字と検索文字列とを比較する方式だけでは、ユーザの望むサイトを網羅的に発見することは難しいという問題がある。 That is, there is a problem that it is difficult to comprehensively find a site desired by a user only with a conventional method of comparing characters in a sentence with a search character string.

本発明は、テキスト形式で格納されている大量の文書から、類似の文書を検索する場合、入力された検索文字列が一致するだけではなく、似たタグが付与されている文書や、類似のキーワードが使用されている文書を発見することができる類似文書検索装置、類似文書検索方法およびプログラムを提供することを目的とする。
In the present invention, when a similar document is searched from a large number of documents stored in a text format, not only the input search character string is matched, but also a similar tag or a similar tag is assigned. An object of the present invention is to provide a similar document search apparatus, a similar document search method, and a program capable of finding a document in which a keyword is used.

本発明は、ユーザが文書を指定し、この指定された文書を読み込む文書入力手段と、入力された文書をワードに分割し、この分割されたワードを正規化し、文書の特徴量を算出する文書解析手段と、入力された文書に含まれているワードを正規化する正規化手段と、文書にタグを付与する対象となるワードが格納されている設定ファイルと、文書に含まれているワードを正規化した文書である正規化文書について、上記設定ファイルに従って、文書中にタグを生成し、記憶手段に記憶させる文書生成手段と、類似文書を検索するために手がかりになるワードと文書ＩＤとが対応しているインデックスを生成するインデックス生成手段と、記憶手段に記憶されている文書に含まれている属性名についての統計情報である属性名統計情報を生成し、記憶装置に記憶する統計情報処理手段と、上記属性名統計情報に基づいて、文書を検索する文書検索手段と、所定の文書を検索する検索条件として使用するキーワードを、上記文書検索手段と文書解析手段とに伝送する要求処理伝送手段と、上記検索結果に含まれている文書を、通信手段を経由して、上記文書記憶手段から読み出し、表示させる文書表示制御手段とを有することを特徴とする類似文書検索装置である。
The present invention relates to a document input means for a user to specify a document and to read the specified document, a document that divides the input document into words, normalizes the divided words, and calculates a feature amount of the document. Analyzing means, normalizing means for normalizing words included in the input document, a setting file storing words to be tagged to the document, and words included in the document For a normalized document, which is a normalized document, a tag generation unit for generating a tag in the document according to the setting file and storing the tag in the storage unit, and a word and a document ID as a clue for searching for similar documents Index generation means for generating a corresponding index, and attribute name statistical information that is statistical information about attribute names included in the document stored in the storage means Statistical information processing means stored in a storage device, document search means for searching for a document based on the attribute name statistical information, and keywords used as search conditions for searching for a predetermined document, the document search means and document analysis Request processing transmission means for transmitting to the means, and document display control means for reading out and displaying the document included in the search result from the document storage means via the communication means. It is a similar document search device.

本発明によれば、テキスト形式で格納されている大量の文書から、類似の文書を検索する場合、入力された検索文字列が一致するだけではなく、似たタグが付与されている文書や、類似のキーワードが使用されている文書を発見することができるという効果を奏する。 According to the present invention, when a similar document is searched from a large number of documents stored in a text format, not only the input search character string matches but also a document with a similar tag, There is an effect that documents in which similar keywords are used can be found.

これによって、本発明によれば、ユーザが望むサイトをより簡易な手段でより多く提示することができ、ユーザの作業効率と利便性が向上するという効果を奏する。
Thus, according to the present invention, it is possible to present more sites desired by the user with simpler means, and the user's work efficiency and convenience are improved.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である類似文書検索装置ＲＳ１を示す図である。 FIG. 1 is a diagram showing a similar document search device RS1 that is Embodiment 1 of the present invention.

類似文書検索装置ＲＳ１は、ユーザが指定したある文書に対して、語群（属性名の構成が似ている文書）を検出する実施例である。 The similar document search device RS1 is an embodiment that detects a word group (a document having a similar attribute name configuration) for a certain document designated by the user.

類似文書検索装置ＲＳ１は、サーバＳＶ１と、ブラウザＢ１とを有する。 The similar document search device RS1 includes a server SV1 and a browser B1.

サーバＳＶ１は、正規化手段１０と、設定ファイル２０と、文書生成手段３０と、インデックス生成手段４０と、統計情報処理手段５０と、文書検索手段６０と、文書解析手段７０と、要求処理伝送手段８０と、通信手段９０と、文書入力手段１００と、文書表示制御手段１１０とを有する。 The server SV1 includes normalization means 10, setting file 20, document generation means 30, index generation means 40, statistical information processing means 50, document search means 60, document analysis means 70, and request processing transmission means. 80, a communication unit 90, a document input unit 100, and a document display control unit 110.

正規化手段１０は、文書に含まれているワードを正規化する。つまり、正規化手段１０は、文書データベースＤＢ１に格納されている文書に含まれている属性値を正規化する。なお、ワードとはその名の通り、文書を分割して出た単語であり、多数の単語（ワード）の中で、属性値として設定ファイル中に指定されていた単語だけが属性値になる。 The normalizing means 10 normalizes words included in the document. That is, the normalizing means 10 normalizes attribute values included in documents stored in the document database DB1. Note that, as the name suggests, a word is a word obtained by dividing a document, and among many words (words), only the word specified in the setting file as an attribute value becomes an attribute value.

「正規化」は、たとえば略記号で表記されている属性値「メルマガ」を、略さずに日本語で表記された属性値「メールマガジン」を追記することである。このように、正規化した際に、置換ではなくて追記としたのは、正規化した単語を完全に置き換えると、最終的にブラウザで表示する場合、変化した単語（＝正規化された単語）で表示され、表示される文書が変化し、これを避けるためである。なお、上記追記を行わずに、略記号で表記されている属性値「メルマガ」を、略さずに日本語で表記された属性値「メールマガジン」に置換するようにしてもよい。 “Normalization” is, for example, adding an attribute value “mail magazine” written in Japanese without omitting an attribute value “mail magazine” written with an abbreviation. In this way, when normalized, it was added instead of replaced. When the normalized word is completely replaced, when it is finally displayed in the browser, the changed word (= normalized word) This is because the displayed document changes to avoid this. Instead of performing the additional recording, the attribute value “Melmaga” written in abbreviated symbols may be replaced with the attribute value “mail magazine” written in Japanese without abbreviations.

つまり、正規化手段１０は、文書中で同じ意味を持ちながら、表現の異なる同義語である属性値を検出し、この検出された同義語を追記する。同義語を検出する場合、次の方法がある。 That is, the normalizing means 10 detects an attribute value that is a synonym having a different expression while having the same meaning in the document, and adds the detected synonym. When detecting synonyms, there are the following methods.

図２は、共起パタンを用いて同義語を検出する方法の例を示す図である。 FIG. 2 is a diagram illustrating an example of a method for detecting synonyms using a co-occurrence pattern.

図２に示す共起パタンを用いる方法を採用することができる。この処理によって、文書データベースＤＢ１における文書のワードが正規化される。 A method using the co-occurrence pattern shown in FIG. 2 can be employed. By this processing, the word of the document in the document database DB1 is normalized.

図３は、実施例１において、正規化する前の文書例を示す図である。 FIG. 3 is a diagram illustrating an example of a document before normalization in the first embodiment.

図４は、正規化とタグ付与文書との例を示す図である。 FIG. 4 is a diagram illustrating an example of normalization and a tagged document.

文書生成手段３０は、正規化された文書が入力され、属性値「メルマガ」が属性名「ｍａｇａｚｉｎｅ」に分類されるという指定があれば、図４に示すように、「文章ＩＤ」と「タグ」とを付与した文書を生成し、この生成された文書を、文書データベースＤＢ１に格納する。 If the normalized document is input and the attribute value “mermaga” is specified to be classified into the attribute name “magazine”, the document generation unit 30 receives “text ID” and “tag” as shown in FIG. Is added, and the generated document is stored in the document database DB1.

なお、属性値、属性名等の指定は、設定ファイル２０に記載されているものとする。 The specification of the attribute value, the attribute name, and the like is described in the setting file 20.

設定ファイル２０には、文書にタグを付与する対象となるワードが指定されている。 In the setting file 20, a word to which a tag is added is specified.

文書生成手段３０は、文書に含まれているワードを正規化した文書である正規化文書について、設定ファイル２０に従って、文書にタグを付加し、文書記憶手段に記憶させる。つまり、文書生成手段３０は、文書ＩＤと属性名とをタグとして埋め込む。なお、上記タグは、図４に＜＞で囲まれているものである。 The document generation unit 30 adds a tag to the document according to the setting file 20 and stores the normalized document, which is a document obtained by normalizing words included in the document, in the document storage unit. That is, the document generation unit 30 embeds the document ID and the attribute name as a tag. The tag is enclosed in <> in FIG.

インデックス生成手段４０は、類似文書を検索するために手がかりになるワードと文書ＩＤとが対応しているインデックスを生成する。 The index generation means 40 generates an index in which a word that serves as a clue to search for a similar document and a document ID correspond to each other.

次に、実施例１において、類似文書検索装置ＲＳ１での検索前処理について説明する。 Next, pre-search processing in the similar document search device RS1 in the first embodiment will be described.

図５は、実施例１における類似文書検索装置ＲＳ１での検索前処理を具体的に示すフローチャートである。 FIG. 5 is a flowchart specifically showing the search pre-processing in the similar document search device RS1 in the first embodiment.

まず、ワードを正規化し（Ｓ１）、正規化されたワードが含まれている文書を生成し、この生成された文書を格納する（Ｓ２）。インデックスを生成し（Ｓ３）、統計情報を生成し（Ｓ４）、終了する。 First, the word is normalized (S1), a document including the normalized word is generated, and the generated document is stored (S2). An index is generated (S3), statistical information is generated (S4), and the process ends.

インデックス生成手段４０が、インデックスを生成する。この「インデックス」は、文書データベースＤＢ１に格納されている文書に含まれているワードと、このワードを含む文書の文書ＩＤとが対応付けられているテーブルである。 Index generation means 40 generates an index. This “index” is a table in which a word included in a document stored in the document database DB1 is associated with a document ID of a document including the word.

図６は、インデックスの例を示す図である。 FIG. 6 is a diagram illustrating an example of an index.

図６に示すように、インデックスで、たとえば、ワード「説明」に対し、このワードを含む文書の文書ＩＤ「００１」等が対応付けられているものである。 As shown in FIG. 6, in the index, for example, the word “description” is associated with the document ID “001” of the document including the word.

また、統計情報処理手段５０は、文書Ｄに基づいて、設定ファイル２０の属性名毎に、統計情報を生成し、属性値統計情報に格納する。 Further, the statistical information processing means 50 generates statistical information for each attribute name of the setting file 20 based on the document D, and stores it in the attribute value statistical information.

つまり、統計情報処理手段５０は、上記文書記憶手段に記憶されている文書に含まれている属性名についての統計情報を生成し、記憶する。 That is, the statistical information processing means 50 generates and stores statistical information about attribute names included in the document stored in the document storage means.

図７は、属性名統計情報の例を示す図である。 FIG. 7 is a diagram illustrating an example of attribute name statistical information.

図７に示す属性名統計情報は、属性名として、「ｍａｇａｚｉｎｅ」が記載され、この「ｍａｇａｚｉｎｅ」が含まれている文書の文書ＩＤと、この文書ＩＤが示す文書に上記属性名「ｍａｇａｚｉｎｅ」が含まれている個数とが対応して記載されている情報である。 In the attribute name statistical information shown in FIG. 7, “magazine” is described as an attribute name. The document ID of the document including this “magazine”, and the attribute name “magazine” is included in the document indicated by the document ID. The number included is information correspondingly described.

次に、図１に示すブロック図の各部で行われている処理のうちで、ブラウザＢ１から入力された文書について、類似文書を類似文書検索装置ＲＳ１が検索する処理について説明する。 Next, among the processes performed in each part of the block diagram shown in FIG. 1, a process in which the similar document search device RS1 searches for a similar document for a document input from the browser B1 will be described.

図８は、図１に示すブロック図の各部で行われている処理のうちで、ブラウザＢ１から入力された文書について、類似文書を類似文書検索装置ＲＳ１が検索する処理を具体的に示すフローチャートである。 FIG. 8 is a flowchart specifically showing a process in which the similar document search apparatus RS1 searches for a similar document among documents input from the browser B1 among the processes performed in each part of the block diagram shown in FIG. is there.

文書検索手段６０は、上記属性名統計情報に基づいて文書を検索する。つまり、文書検索手段６０は、指定された条件を満たす文書群において、各文書に付随する属性名統計情報を比較し、ある一定範囲内の値を持つ文書を検出する手段である。なお、上記「指定された条件を満たす文書群」は、ワード入力部で指定された語を含んだ文書群である。 The document search means 60 searches for a document based on the attribute name statistical information. That is, the document search unit 60 is a unit that compares attribute name statistical information attached to each document in a document group that satisfies a specified condition, and detects a document having a value within a certain range. The “document group satisfying the specified condition” is a document group including the word specified by the word input unit.

文書入力手段１００は、たとえば、ある文書αがユーザによって入力されると、この文書αを通信手段９０に送信する。 For example, when a document α is input by the user, the document input unit 100 transmits the document α to the communication unit 90.

通信手段９０は、送信された文書αを、要求処理伝送手段８０に与え、要求処理伝送手段８０は、与えられた文書αを文書解析手段７０に与える。 The communication means 90 gives the transmitted document α to the request processing transmission means 80, and the request processing transmission means 80 gives the given document α to the document analysis means 70.

文書解析手段７０は、入力された文書の内容を解析する。つまり、文書解析手段７０は、文書αに含まれているワードを正規化し、設定ファイル２０に格納されている属性値を、文書から検索する。すなわち、文書解析手段７０は、単語に分割する処理と、属性名ごとの特徴量（タグ）を算出する処理との２つの処理を実行する。属性値が存在すれば、属性名毎に特徴量を算出し、その結果を要求処理伝送手段８０に返却する。 The document analysis means 70 analyzes the contents of the input document. That is, the document analysis unit 70 normalizes words included in the document α and searches the document for attribute values stored in the setting file 20. That is, the document analysis unit 70 executes two processes, a process of dividing into words and a process of calculating a feature value (tag) for each attribute name. If the attribute value exists, the feature amount is calculated for each attribute name, and the result is returned to the request processing transmission unit 80.

要求処理伝送手段８０は、所定の文書を検索する検索条件として使用するキーワードを、上記文書検索手段と文書解析手段とに伝送する。 The request processing transmission unit 80 transmits a keyword used as a search condition for searching for a predetermined document to the document search unit and the document analysis unit.

文書入力手段１００は、ユーザが指定した文書を読み込む。 The document input unit 100 reads a document designated by the user.

文書表示制御手段１１０は、上記検索結果に含まれている文書を、通信手段９０を経由して、上記文書記憶手段から読み出し、表示させる。 The document display control unit 110 reads out the document included in the search result from the document storage unit via the communication unit 90 and displays it.

特徴量を算出する場合、出現数を単純にカウントする方法、閾値を設定し分類する方法、文書中に出現する全語における出現数の割合を算出する方法がある。 When calculating the feature amount, there are a method of simply counting the number of appearances, a method of setting and classifying a threshold value, and a method of calculating the ratio of the number of appearances in all words appearing in the document.

すなわち、まず、文書を入力し（Ｓ１１）、文書を解析し、つまり、文書を単語に分割し、属性名ごとの特徴量を算出し（Ｓ１２）、文書を検索し（Ｓ１３）、この検索結果を送信する（Ｓ１４）。この検索結果を、ブラウザ上で表示し（Ｓ１５）、終了する。 That is, first, a document is input (S11), the document is analyzed, that is, the document is divided into words, the feature amount for each attribute name is calculated (S12), the document is searched (S13), and the search result Is transmitted (S14). The search result is displayed on the browser (S15), and the process ends.

図９は、実施例１における特徴ベクトルの例を示す図である。 FIG. 9 is a diagram illustrating an example of feature vectors in the first embodiment.

要求処理伝送手段８０は、処理要求を文書検索手段６０へ与え、文書検索手段６０は、属性名統計情報に基づいて、属性名と特徴量との組み合わせが同一である文書または上記組合せが似ている文書のＩＤを検索し、この検索された文書ＩＤを、要求処理伝送手段８０に返却する。 The request processing transmission unit 80 gives a processing request to the document search unit 60. The document search unit 60 uses the attribute name statistical information and the document having the same combination of the attribute name and the feature amount or the above combination is similar. The retrieved document ID is retrieved, and the retrieved document ID is returned to the request processing transmission means 80.

属性名と特徴量との組み合わせが文書αと同一である文書は、図９に示すように、属性名「ｉｎｐｕｔ」の特徴量が「２」であり、属性名「ｋｅｙｓ」の特徴量が「５」であり、属性名「ｍａｇａｚｉｎｅ」の特徴量が「３」であり、…………というように、文書αにおける属性名とその特徴量との組合せが全く同じ文書である。 As shown in FIG. 9, a document whose attribute name and feature amount combination is the same as the document α has a feature amount “2” of the attribute name “input” and a feature amount of the attribute name “keys” “ 5 ”, the feature amount of the attribute name“ magazine ”is“ 3 ”, and so on....

ただし、属性名統計情報に基づいて、文書を検索する場合、属性名とその特徴量との組合せが文書αと全く同じ文書だけを検索するのではなく、文書αと類似する文書をも検索できるようにすることが望ましい。つまり、文書αと類似する文書は、属性名に対する特徴量が、文書αにおける各属性名に対する特徴量から所定の幅内にあるものである。 However, when searching for a document based on the attribute name statistical information, it is possible to search not only a document whose combination of the attribute name and its feature quantity is exactly the same as the document α, but also a document similar to the document α. It is desirable to do so. That is, the document similar to the document α has a feature amount for the attribute name within a predetermined range from the feature amount for each attribute name in the document α.

文書αと類似する文書の基準は、たとえば、文書αにおける各属性名に対する特徴量の±所定％以内の特徴量を持つ文書であるという基準、または、文書αにおける各属性名に対する特徴量の±所定％以内の特徴量が、文書αにおける属性名の合計のうちの半分以上であるという基準である。 The standard of the document similar to the document α is, for example, a standard that the document α has a feature amount within ± predetermined% of the feature amount for each attribute name in the document α, or a ± This is a criterion that the feature amount within the predetermined percentage is half or more of the total attribute names in the document α.

要求処理伝送手段８０は、文書データベースＤＢ１から、文書ＩＤに該当する文書を全て読み出し、この読み出した文書を、通信手段９０を介して、ブラウザに送信する。 The request processing transmission unit 80 reads all the documents corresponding to the document ID from the document database DB1, and transmits the read documents to the browser via the communication unit 90.

図１０は、文書表示制御部１１０が表示する文書の表示例を示す図である。 FIG. 10 is a diagram illustrating a display example of a document displayed by the document display control unit 110.

ブラウザの文書表示制御部１１０は、図１０に示すように、与えられた各文書中から一部を抽出し、文書一覧を作成し、表示する。なお、抽出される一部分は、予め指定された場所（カラム、行、フォーマット、キーワード）とする。ユーザによって１つの文書が選択されると、表示している一覧を消去し、選択された文書の全体を表示する。 As shown in FIG. 10, the document display control unit 110 of the browser extracts a part from each given document, creates a document list, and displays it. A part to be extracted is a place (column, row, format, keyword) designated in advance. When one document is selected by the user, the displayed list is erased and the entire selected document is displayed.

図１１は、図１０中の文書ＩＤ００１を選択し、文書ＩＤ００１の文書の全体を表示している例を示す図である。
FIG. 11 is a diagram illustrating an example in which the document ID 001 in FIG. 10 is selected and the entire document with the document ID 001 is displayed.

実施例２は、ユーザが指定する語群（すなわち属性名または属性値の構成が類似している文書）を検出する実施例である。 Example 2 is an example in which a word group specified by a user (that is, a document having a similar attribute name or attribute value configuration) is detected.

図１２は、本発明の実施例２である類似文書検索装置ＲＳ２を示すブロック図である。 FIG. 12 is a block diagram showing a similar document search device RS2 that is Embodiment 2 of the present invention.

類似文書検索装置ＲＳ２は、類似文書検索装置ＲＳ１において、文書解析手段７０が削除され、属性名統計情報Ａ１の代わりに属性名統計情報Ａ２が設けられている装置である。また、ブラウザＢ１の代わりに、ブラウザＢ２が設けられている。 The similar document search device RS2 is a device in which the document analysis unit 70 is deleted from the similar document search device RS1 and attribute name statistical information A2 is provided instead of the attribute name statistical information A1. A browser B2 is provided instead of the browser B1.

類似文書検索装置ＲＳ２は、正規化手段１０と、設定ファイル２０と、文書生成手段３０と、インデックス生成手段４０と、統計情報処理手段５０と、文書検索手段６０と、要求処理伝送手段８０と、通信手段９０と、文書入力手段１００と、文書表示制御手段１１０とを有する。 The similar document search device RS2 includes a normalization unit 10, a setting file 20, a document generation unit 30, an index generation unit 40, a statistical information processing unit 50, a document search unit 60, a request processing transmission unit 80, A communication unit 90, a document input unit 100, and a document display control unit 110 are included.

また、ブラウザＢ２は、属性名と属性値とを入力する属性名属性値入力手段１２０と、文書表示制御手段１３０とを有する。 The browser B2 includes an attribute name attribute value input unit 120 for inputting an attribute name and an attribute value, and a document display control unit 130.

図１３は、図１２で示したブロック図の各部で行われている処理のうちで、類似文書検索装置ＲＳ２内の検索前処理を具体的に示すフローチャートである。 FIG. 13 is a flowchart specifically showing pre-search processing in the similar document search device RS2 among the processes performed in the respective parts of the block diagram shown in FIG.

Ｓ２１〜Ｓ２３の処理は、類似文書検索装置ＲＳにおけるＳ１〜Ｓ３の処理と同様である。 The processes of S21 to S23 are the same as the processes of S1 to S3 in the similar document search device RS.

統計情報処理手段５０は、文書データベースＤＢ１に格納されている文書に基づいて、設定ファイル２０に格納されている属性名毎に、統計情報を生成し、属性名統計情報として格納する。 The statistical information processing means 50 generates statistical information for each attribute name stored in the setting file 20 based on the document stored in the document database DB1, and stores it as attribute name statistical information.

まず、ワードを正規化し（Ｓ２１）、正規化されたワードが含まれている文書を生成し、この生成された文書を格納し（Ｓ２２）、インデックスを生成する（Ｓ２３）。そして、統計情報を生成し（Ｓ２４）。終了する。 First, the word is normalized (S21), a document including the normalized word is generated, the generated document is stored (S22), and an index is generated (S23). Then, statistical information is generated (S24). finish.

図１４は、１つの統計情報には１つの属性名が割り当てられている様子を示す図である。 FIG. 14 is a diagram illustrating a state where one attribute name is assigned to one piece of statistical information.

また、１つの統計情報は、文書ＩＤとこのＩＤの文書に含まれる属性値とを対応付けたものを１以上備える情報である。 Further, one piece of statistical information is information including one or more items in which a document ID is associated with an attribute value included in a document with this ID.

つまり、図１４は、１つの属性名「ｍａｇａｚｉｎｅ」に着目し、この属性名「ｍａｇａｚｉｎｅ」が含まれている文書のＩＤと、この文書ＩＤに対応する文書に含まれている属性値とを示す図である。 That is, FIG. 14 pays attention to one attribute name “magazine”, and shows the ID of a document including the attribute name “magazine” and the attribute value included in the document corresponding to the document ID. FIG.

図１５は、類似文書検索装置ＲＳ２を構成する各部で行われている処理のうちで、ブラウザから入力された属性名または属性値について、類似文書検索装置ＲＳ２が類似文書を検索する処理を具体的に示すフローチャートである。 FIG. 15 shows a specific example of the process in which the similar document search apparatus RS2 searches for a similar document with respect to the attribute name or attribute value input from the browser, among the processes performed in each unit constituting the similar document search apparatus RS2. It is a flowchart shown in FIG.

属性名をユーザが入力し（Ｓ３１）、この入力された属性名に基づいて、文書を検索し（Ｓ３２）、この検索された文書である処理結果を、ブラウザＢ２に送信し（Ｓ３３）。ブラウザＢ２上で、上記検索された文書を表示する（Ｓ３４）。 The user inputs an attribute name (S31), searches for a document based on the input attribute name (S32), and transmits the processing result of the searched document to browser B2 (S33). The retrieved document is displayed on the browser B2 (S34).

つまり、属性名属性値入力手段１２０において、ユーザが属性名として「ｍａｇａｚｉｎｅ」を入力すると、この入力された属性名「ｍａｇａｚｉｎｅ」を、通信手段９０に送信する。なお、属性名または属性値を、複数選択するようにしてもよい。また、属性名または属性値が、デフォルトで選択済みであるとしてもよい。 That is, in the attribute name attribute value input unit 120, when the user inputs “magazine” as the attribute name, the input attribute name “magazine” is transmitted to the communication unit 90. A plurality of attribute names or attribute values may be selected. Also, the attribute name or attribute value may be selected by default.

図１６は、ブラウザＢ２上で属性名または属性値を選択する方法の説明図である。 FIG. 16 is an explanatory diagram of a method for selecting an attribute name or attribute value on the browser B2.

通信手段９０は、受信した属性名を要求処理伝送手段８０に与え、要求処理伝送手段８０は、上記属性名を文書検索手段６０へ与え、文書検索手段６０は、属性名統計情報に基づいて、入力された全ての属性名を含む文書の文書ＩＤを検索し、この検索された文書ＩＤを、要求処理伝送手段８０に返却する。 The communication means 90 gives the received attribute name to the request processing transmission means 80, the request processing transmission means 80 gives the attribute name to the document search means 60, and the document search means 60 is based on the attribute name statistical information. The document ID of the document including all the input attribute names is searched, and the searched document ID is returned to the request processing transmission unit 80.

なお、Ｓ３３、Ｓ３４の処理は、類似文書検索装置ＲＳ１におけるＳ１４〜Ｓ１５の処理と同様である。
Note that the processing of S33 and S34 is the same as the processing of S14 to S15 in the similar document search device RS1.

本発明の実施例３は、ユーザが指定したワードと語群（属性名または属性値を含み、かつ属性名の構成が類似している文書）を検出する実施例である。 The third embodiment of the present invention is an embodiment that detects a word and a word group (documents including attribute names or attribute values and having similar attribute name configurations) specified by the user.

図１７は、本発明の実施例３である類似文書検索装置ＲＳ３を示すブロック図である。 FIG. 17 is a block diagram showing a similar document search device RS3 that is Embodiment 3 of the present invention.

類似文書検索装置ＲＳ３は、類似文書検索装置ＲＳ２と同じであるが、ブラウザＢ２の代わりに、ブラウザＢ３が設けられている。ブラウザＢ３は、属性名属性値入力手段１２０と、ワード入力手段１４０と、文書制御手段１５０とを有する。 The similar document search device RS3 is the same as the similar document search device RS2, but a browser B3 is provided instead of the browser B2. The browser B3 includes an attribute name attribute value input unit 120, a word input unit 140, and a document control unit 150.

ワード入力手段１４０は、特定のキーワードを入力する手段である。 The word input unit 140 is a unit for inputting a specific keyword.

類似文書検索装置ＲＳ３における検索前処理は、図１３に示す実施例２における処理と同様である。 The pre-search process in the similar document search device RS3 is the same as the process in the second embodiment shown in FIG.

図１８は、図１７に示す類似文書検索装置ＲＳ３を構成する各部で行われている処理のうちで、ブラウザＢ３から入力されたワードと属性名について、類似文書検索装置ＲＳ３が類似文書を検索する処理を具体的に示すフローチャートである。 FIG. 18 shows that the similar document search device RS3 searches for a similar document for words and attribute names input from the browser B3, among the processes performed in each part of the similar document search device RS3 shown in FIG. It is a flowchart which shows a process concretely.

ワード入力手段１４０は、たとえばワード「着うた」がユーザによって入力されると、このワード「着うた」を、通信手段９０に送信する。なお、上記ワードを複数入力するようにしてもよい。 For example, when the word “Chaku-Uta” is input by the user, the word input unit 140 transmits the word “Chaku-Uta” to the communication unit 90. A plurality of the words may be input.

属性名属性値入力手段１２０は、ユーザが、属性名として「ｍａｇａｚｉｎｅ」を指定すると、この指定さするようにしてもよい。 The attribute name attribute value input means 120 may be designated when the user designates “magazine” as the attribute name.

つまり、まず、ワード入力部１４０を介して、ユーザが指定した単語（入力単語またはユーザ指定語であり、文書を分割した結果の単語とは異なる）、属性名を入力し（Ｓ４１）。上記ユーザが指定した単語に基づいて、文書を検索し（Ｓ４２）、属性名に基づいて、文書を検索し（Ｓ４３）。これらの検索結果を統合し（Ｓ４４）、この統合が済んだ処理結果を送信し（Ｓ４５）、ブラウザＢ３上で表示する（Ｓ４６）。 That is, first, a word specified by the user (which is an input word or a user-specified word and different from a word obtained by dividing the document) and an attribute name are input via the word input unit 140 (S41). The document is searched based on the word specified by the user (S42), and the document is searched based on the attribute name (S43). These search results are integrated (S44), the processing result after the integration is transmitted (S45), and displayed on the browser B3 (S46).

図１９は、ブラウザＢ３上で、属性名を選択し、ワードを入力する場合の説明図である。 FIG. 19 is an explanatory diagram of selecting an attribute name and inputting a word on the browser B3.

通信手段９０は、送信されたキーワードと、属性名αとを要求処理に与え、要求処理伝送手段８０は、与えられたキーワードと属性名αとを文書検索手段６０へ与える。 The communication means 90 gives the transmitted keyword and attribute name α to the request process, and the request process transmission means 80 gives the given keyword and attribute name α to the document search means 60.

文書検索手段６０は、キーワード「着うた」について、インデックスで対応付けられている文書ＩＤを検索し、また、属性名統計情報に基づいて、入力された全ての属性名を含む文書の文書ＩＤを検索し、双方に共通の文書ＩＤを、要求処理伝送手段８０に返却する。 The document retrieval means 60 retrieves the document ID associated with the index for the keyword “Chaku-Uta”, and retrieves the document ID of the document including all the input attribute names based on the attribute name statistical information. Then, the document ID common to both is returned to the request processing transmission means 80.

Ｓ４５〜Ｓ４６までの処理は、Ｓ１４〜Ｓ１５の処理と同様である。 The processing from S45 to S46 is the same as the processing from S14 to S15.

また、上記実施例を方法の発明として把握することができる。つまり、上記実施例は、ユーザが文書を指定し、この指定された文書を読み込む文書入力工程と、入力された文書をワードに分割する文書分割工程と、入力された文書に含まれているワードを正規化する正規化工程と、文書にタグを付与する対象と、なるワードを設定ファイルに格納するワード格納工程と、文書に含まれているワードを正規化した文書である正規化文書について、上記設定ファイルに従って、文書中にタグを生成し、記憶装置に記憶させる文書生成工程と、類似文書を検索するために手がかりになるワードと、文書ＩＤと、が対応しているインデックスを生成するインデックス生成工程と、上記記憶装置に記憶されている文書に含まれている属性名についての統計情報である属性名統計情報を生成し、記憶装置に記憶する統計情報処理工程と、上記属性名統計情報に基づいて、文書を検索する文書検索工程と、所定の文書を検索する検索条件として使用するキーワードを、文書検索手段と、文書解析手段とに伝送する要求処理伝送工程と、上記検索結果に含まれている文書を、通信手段を経由して、上記記憶装置から読み出し、表示させる文書表示制御工程とを有することを特徴とする類似文書検索方法。 Moreover, the said Example can be grasped | ascertained as invention of a method. That is, in the above embodiment, the user designates a document, a document input step for reading the designated document, a document division step for dividing the inputted document into words, and a word included in the inputted document. About a normalization process for normalizing a document, a target to which a tag is attached to a document, a word storage process for storing a word in a setting file, and a normalized document that is a document obtained by normalizing a word included in the document An index for generating an index in which a tag is generated in a document according to the setting file and stored in a storage device, a word serving as a clue to search for a similar document, and a document ID Generating the attribute name statistical information, which is statistical information about the attribute name included in the document stored in the storage device, and storing the attribute name statistical information in the storage device; Request to transmit information processing step, document search step for searching for document based on attribute name statistical information, and keyword used as search condition for searching predetermined document to document search means and document analysis means A similar document search method comprising: a process transmission step; and a document display control step of reading and displaying a document included in the search result from the storage device via a communication unit.

なお、上記キーワードは、ワード入力部１４０を介して入力された単語のことであり、すなわちユーザが指定した単語である。 The keyword is a word input through the word input unit 140, that is, a word specified by the user.

また、上記類似文書検索装置を構成する各手段として、コンピュータを機能させるプログラムを想定するようにしてもよい。
A program that causes a computer to function may be assumed as each means constituting the similar document search apparatus.

本発明の実施例１である類似文書検索装置ＲＳ１を示す図である。It is a figure which shows similar document search device RS1 which is Example 1 of this invention. 共起パタンを用いて同義語を検出する方法の例を示す図である。It is a figure which shows the example of the method of detecting a synonym using a co-occurrence pattern. 実施例１において、正規化する前の文書例を示す図である。6 is a diagram illustrating an example of a document before normalization in Embodiment 1. FIG. 正規化とタグ付与文書との例を示す図である。It is a figure which shows the example of normalization and a tag addition document. 実施例１における類似文書検索装置ＲＳ１での検索前処理を具体的に示すフローチャートである。6 is a flowchart specifically illustrating search pre-processing in the similar document search device RS1 according to the first embodiment. インデックスの例を示す図である。It is a figure which shows the example of an index. 属性名統計情報の例を示す図である。It is a figure which shows the example of attribute name statistical information. 図１に示すブロック図の各部で行われている処理のうちで、ブラウザから入力された文書について、類似文書を類似文書検索装置ＲＳ１が検索する処理を具体的に示すフローチャートである。It is a flowchart which shows specifically the process in which the similar document search apparatus RS1 searches a similar document among the processes currently performed by each part of the block diagram shown in FIG. 1 about the document input from the browser. 実施例１における特徴ベクトルの例を示す図である。6 is a diagram illustrating an example of a feature vector in Embodiment 1. FIG. 文書表示制御部１１０が表示する文書の表示例を示す図である。It is a figure which shows the example of a display of the document which the document display control part 110 displays. 図１０中の文書ＩＤ００１を選択し、文書ＩＤ００１の文書の全体を表示している例を示す図である。FIG. 11 is a diagram illustrating an example in which a document ID 001 in FIG. 10 is selected and an entire document with a document ID 001 is displayed. 本発明の実施例２である類似文書検索装置ＲＳ２を示すブロック図である。It is a block diagram which shows similar document search device RS2 which is Example 2 of this invention. 図１２で示したブロック図の各部で行われている処理のうちで、類似文書検索装置ＲＳ２内の検索前処理を具体的に示すフローチャートである。It is a flowchart which shows concretely the search pre-processing in similar document search device RS2 among the processes currently performed by each part of the block diagram shown in FIG. １つの属性名「ｍａｇａｚｉｎｅ」に着目し、この属性名「ｍａｇａｚｉｎｅ」が含まれている文書のＩＤと、この文書ＩＤに対応する文書に含まれている属性値とを示す図である。It is a figure which pays attention to one attribute name "magazine" and shows the ID of the document including this attribute name "magazine" and the attribute value included in the document corresponding to this document ID. 類似文書検索装置ＲＳ２を構成する各部で行われている処理のうちで、ブラウザから入力された属性名または属性値について、類似文書検索装置ＲＳ２が類似文書を検索する処理を具体的に示すフローチャートである。Among the processes performed in each part constituting the similar document search device RS2, a flowchart specifically showing the process in which the similar document search device RS2 searches for a similar document for the attribute name or attribute value input from the browser. is there. ブラウザＢ２上で属性名または属性値を選択する方法の説明図である。It is explanatory drawing of the method of selecting an attribute name or an attribute value on browser B2. 本発明の実施例３である類似文書検索装置ＲＳ３を示すブロック図である。It is a block diagram which shows similar document search device RS3 which is Example 3 of this invention. 図１７に示す類似文書検索装置ＲＳ３を構成する各部で行われている処理のうちで、ブラウザＢ３から入力されたワードと属性名について、類似文書検索装置ＲＳ３が類似文書を検索する処理を具体的に示すフローチャートである。Of the processes performed in each part of the similar document search device RS3 shown in FIG. 17, the similar document search device RS3 searches for similar documents for words and attribute names input from the browser B3. It is a flowchart shown in FIG. ブラウザＢ３上で、属性名を選択し、ワードを入力する場合の説明図である。It is explanatory drawing in the case of selecting an attribute name and inputting a word on browser B3.

Explanation of symbols

ＲＳ１…類似文書検索装置、
１０…正規化手段、
２０…設定ファイル、
３０…文書生成手段、
４０…インデックス生成手段、
５０…統計情報処理手段、
６０…文書検索手段、
７０…文書解析手段、
８０…要求処理伝送手段、
９０…通信手段、
Ｂ１…ブラウザ、
１００…文書入力手段、
１１０…文書表示制御手段、
ＲＳ２…類似文書検索装置、
Ｂ２…ブラウザ、
１２０…属性名属性値入力手段、
１３０…文書表示制御手段、
ＲＳ３…類似文書検索装置、
Ｂ３…ブラウザ、
１４０…ワード入力手段、
１５０…文書表示制御手段。 RS1 ... similar document search device,
10 ... normalization means,
20 ... Setting file,
30 ... Document generation means,
40: Index generation means,
50. Statistical information processing means,
60 ... Document search means,
70: Document analysis means,
80 ... Request processing transmission means,
90 ... communication means,
B1 ... Browser,
100: Document input means,
110 ... Document display control means,
RS2 ... Similar document search device,
B2 ... Browser,
120 ... Attribute name attribute value input means,
130: Document display control means,
RS3 ... Similar document search device,
B3 ... Browser,
140 ... word input means,
150. Document display control means.

Claims

A document input means in which a user designates a document and reads the designated document;
Document analysis means for dividing the input document into words, normalizing the divided words, and calculating document features;
Normalization means for normalizing words contained in the input document;
A configuration file that stores the words to be tagged with the document;
A document generation unit that generates a tag in the document and stores the tag in the storage unit according to the setting file for a normalized document that is a document obtained by normalizing a word included in the document;
Index generation means for generating an index in which a word that is a clue to search for a similar document and a document ID correspond;
Statistical information processing means for generating attribute name statistical information, which is statistical information about attribute names included in a document stored in the storage means, and storing the attribute name statistical information in the storage device;
A document search means for searching for a document based on the attribute name statistical information;
Request processing transmission means for transmitting a keyword used as a search condition for searching for a predetermined document to the document search means and the document analysis means;
A document display control means for reading out and displaying a document included in the search result from the document storage means via a communication means;
A similar document search device characterized by comprising:

A normalization means for normalizing words contained in the document;
A configuration file that stores the words to be tagged with the document;
A document generation unit that generates a tag in the document according to the setting file and stores the tag in the document storage unit with respect to a normalized document that is a document obtained by normalizing a word included in the document;
Index generation means for generating an index in which a word that is a clue to search for a similar document and a document ID correspond;
Statistical information processing means for generating attribute name statistical information, which is statistical information about attribute names included in a document stored in the document storage means, and storing the attribute name statistical information in a storage device;
A document search means for searching for a document based on the attribute name statistical information;
Request processing transmission means for transmitting search conditions to the document search means and the document analysis means;
An attribute name attribute value input means for the user to select the displayed attribute name and attribute value;
A document display control means for reading out and displaying a document included in the search result from the document storage means via a communication means;
A similar document search device characterized by comprising:

In claim 2,
A similar document search apparatus comprising word input means for inputting a specific keyword.

In any one of Claims 1-3,
The document search means is means for comparing attribute name statistical information attached to each document in a document group satisfying a designated condition, and detecting a document having a value within a certain range. Document retrieval device.

A document input process in which the user designates a document and reads the designated document;
A document analysis step of dividing the input document into words, normalizing the divided words, and calculating document features;
A normalization step of normalizing words contained in the input document;
A word storing step of storing in a setting file a word to be tagged with a document;
A document generation step of generating a tag in the document and storing it in a storage device according to the setting file for a normalized document that is a document obtained by normalizing words included in the document;
An index generation step of generating an index in which a word serving as a clue to search for a similar document and a document ID correspond;
A statistical information processing step of generating attribute name statistical information, which is statistical information about attribute names included in the document stored in the storage device, and storing the attribute name statistical information in the storage device;
A document search step for searching for a document based on the attribute name statistical information;
A request processing transmission step of transmitting a keyword used as a search condition for searching for a predetermined document to the document search means and the document analysis means;
A document display control step of reading and displaying a document included in the search result from the storage device via a communication means;
A similar document search method characterized by comprising:

A program that causes a computer to function as each means constituting the similar document search device according to claim 1.