JP4888677B2

JP4888677B2 - Document search system

Info

Publication number: JP4888677B2
Application number: JP2001206840A
Authority: JP
Inventors: 雅実鈴木; 直己井ノ上; 一則松本; 和夫橋本
Original assignee: National Institute of Information and Communications Technology; KDDI R&D Laboratories Inc
Current assignee: National Institute of Information and Communications Technology; KDDI R&D Laboratories Inc
Priority date: 2001-07-06
Filing date: 2001-07-06
Publication date: 2012-02-29
Anticipated expiration: 2021-07-06
Also published as: JP2003022275A

Description

【０００１】
【発明の属する技術分野】
本発明は文書検索システムに関し、特に分野毎の有用な文書から抽出した共起語情報に基づいて、検索要求を拡張できるようにした文書検索システムに関する。
【０００２】
【従来の技術】
従来の情報検索方式としては、全文またはインデックス・データベースを対象としたキーワード検索方式、および、検索要求と検索対象文書に対して、語の出現分布に基づくベクトル空間モデルや確率モデルに基づく類似（連想）文書検索方式が知られている。しかしながら、これらの方式では、検索要求として与えられた情報（キーワード、フレーズ等）を手がかりとして検索を行うため、必ずしも目的に適合した文書が効率良く検索できるとは限らない。
【０００３】
そこで、検索結果の中の、利用者が有用と判断した文書から抽出した重要語を最初の検索要求に加える適合フィードバック方式、または利用者の判断を経ない擬似的適合フィードバック方式が提案された。また、検索対象に関する何らかの知識源から取り出した関連語を利用者に示し、検索要求への選択的な追加を許す方式も提案されている。
【０００４】
一般に、検索要求が曖昧であったり断片的であったりする場合には、大量の検索結果の中から、目的の文書を探索することが非常に困難であるが、前記フィードバック方式等は、検索要求自体を何らかの形で拡張して検索目的に反映することができるので、有意義な方式であると考えられる。
【０００５】
【発明が解決しようとする課題】
しかしながら、前記適合フィードバック方式は、検索の利用者に対して、大きな判断負荷をかけるという問題があった。また、前記擬似的適合フィードバック方式は、利用者の判断を経ないが、検索結果の上位の文書群から自動的に重要語を追加するため、必ずしも効率良く検索精度が向上するとは限らないという問題があった。
【０００６】
また、前記検索要求への選択的な追加を許す方式は、階層的なシソーラスや一連の文書群を参照して関連語を決定することになるため、適切な情報源として特定するだけの根拠が曖昧である場合が多いという問題があった。
【０００７】
したがって、前記した従来の情報検索方式では、検索要求に適切な語を追加するなどの拡張を行うに当たって、煩雑な手続が必要であったり、十分に的確な情報が参照されているかについて不確かであったりするという問題があった。
【０００８】
本発明は、前記した従来技術に鑑みてなされたものであり、その目的は、検索結果の中から、適合文書を効率良く発見することを支援する文書検索システムを提供することにある。
【０００９】
【課題を解決するための手段】
前記した目的を達成するために、本発明は、共起語表が格納されている共起語データベースと、検索要求に従って前記共起語データベースを参照し、該検索要求を拡張する検索要求拡張部と、前記検索要求と、該検索要求拡張部で追加された共起語とを用いて、検索対象文書を類似文書検索する検索実行部と、該検索実行部の類似文書検索によって得られた文書を表示する検索結果表示部とを具備し、前記共起語データベースを作成する手段は、共起語抽出の基となる文書を形態素解析し、該文書から名詞を抽出する名詞抽出手段と、該抽出された名詞の２語の組合せにおいて、該２語が共に出現する文書数、該２語のうちの一方の語が出現し他方の語が出現しない文書数、該２語のうちの他方の語が出現し一方の語が出現しない文書数、および該２語の両方とも出現しない文書数を用いて、該２語が共起する尤もらしさを表す尤度を計算する尤度計算手段と、該尤度計算手段で求められた尤度順に順序付けられた共起語表を作成する手段とからなる点に特徴がある。
【００１１】
前記特徴によれば、検索要求に共起語を追加補完して検索実行をすることができるので、検索結果の中から適合文書を効率良く発見することができるようになる。
【００１２】
【発明の実施の形態】
以下に、図面を参照して、本発明を詳細に説明する。図１は、本発明の文書検索システムの一実施形態を示すブロック図である。
【００１３】
該文書検索システムは、大別すると、図示されているように、共起語情報抽出部１と検索処理部２とから構成されている。また、該共起語情報抽出部１は、共起語抽出の基となる文書（事例）を登録する事例登録部１１、該事例登録部１１に登録された事例から、形態素解析辞書１３を参照して共起語情報を抽出する共起語情報抽出部１２、および該共起語情報抽出部１２で抽出された共起語情報を、分野別に蓄積する分野別共起語ＤＢ（データベース）１４から構成されている。一方、前記検索処理部２は、分野を指定して検索要求（検索語）を与える検索要求解析部２１と、前記分野別共起語ＤＢ１４を参照して、前記検索要求を拡張または追加補完する検索要求拡張部２２と、該検索要求拡張部２２で拡張された検索要求を用いてインターネット上の文書等である検索対象文書２５を検索する検索実行部２３と、検索結果表示部２４とから構成されている。
【００１４】
次に、前記共起語情報抽出部１の機能を、図１、図２を参照して詳細に説明する。図２は、該共起語情報抽出部１の機能を示すフローチャートである。
【００１５】
まず、共起語抽出の原理または概念を説明する。例えば、「理科」の教科書領域内で、天文関連の用語である「火星」に関する学習情報を検索する場合を考えると、「火星」のみを検索語とした場合は、検索意図が曖昧で、検索範囲によっては、大量の不適合文書が検索されることになる。もし、この「火星」に、学習利用可能と判断された文書群内で、「火星」と高頻度で共起する語を追加したとすると、より検索結果は絞り込まれたものとなる。このように、高頻度で共起する検索語を追加した上で、類似文書検索を実行すれば、従来のキーワードのＡＮＤ検索や、ＯＲ検索では得られない精度と再現率の両立を図ることが可能であると考えられる。以上が該共起語抽出の原理または概念である。
【００１６】
そこで、図２のステップＳ１では、オペレータは、事例登録部１１に、活用実績のある、または活用可能と判断した文書を登録する。この登録の際に、分野も一緒に登録するのが好適である。例えば、「理科」の教科書領域内で活用可能と判断した文書を多数収集し、分野、例えば天文、地学、物理、化学、動物、植物、気象などの分野を指定して、事例登録部１１に登録する。
【００１７】
次に、ステップＳ２では、共起語情報抽出部１２は、形態素解析辞書１３を参照して、各分野毎に、各文書を形態素解析し、文書中から名詞のみを抽出し、各名詞（異なり語）が出現する文書数（または、ページ数）を計算する。その一例である実験例を、図３に示す。図３では、抽出された名詞が、代表語として、頻度順に、並べられている。
【００１８】
上記ステップＳ２のように、単純に共起頻度の高い語を順に選ぶだけでは、元の語と共起する語双方の単独での出現頻度に対する、相対的な共起性が反映されない。そこで、ステップＳ３、Ｓ４の処理をする。
【００１９】
すなわち、ステップＳ３では、各分野毎に、抽出された名詞（異なり語）のうち、全ての２語の組み合わせについて、同時に出現する文書数、一方の語だけが出現するそれぞれの文書数、および両方の語とも出現しない文書数を算出する。
【００２０】
一般的に、図４に示すように、２語の組み合わせｗ_ｊ、ｗ_ｋにおいて、２語ｗ_ｊ、ｗ_ｋが共に出現する文書数をｎ_１１、ｗ_ｊが出現し、ｗ_ｋが出現しない文書数をｎ_１２、ｗ_ｊが出現せず、ｗ_ｋが出現する文書数をｎ_２１、およびｗ_ｊ、ｗ_ｋ共に出現しない文書数をｎ_２２とする。
【００２１】
次に、ステップＳ４では、各分野毎に、各語について、それと共起する他の語に対して、ステップＳ３の結果を用いた尤度計算を行い、尤度順に順序付けられた共起語表、すなわち共起語ＤＢを作成する。
【００２２】
以下に、該ステップＳ４の処理をより詳細に説明する。
いま、共起語を収集する文書集合を、Ｄ｛ｄ_１，ｄ_２，・・・，ｄ_Ｎ｝、集合内の異なり語を｛ｗ_１，ｗ_２，・・・，ｗ_Ｍ｝とする。ここで、ｄ_ｉは個々の文書を、ｗ_ｊは個々の異なり語を示す。
【００２３】
そうすると、共起度数Ｃは、次のように表すことができる。
もし、ｗ_ｊとｗ_ｋが同一文書に出現しているならば、Ｃ（ｄ_ｉ，ｗ_ｊ，ｗ_ｋ）＝１、もし、ｗ_ｊとｗ_ｋが同一文書に出現していないならば、Ｃ（ｄ_ｉ，ｗ_ｊ，ｗ_ｋ）＝０、
【００２４】
次に、１９９５年、講談社サイエンティフィク発行の「情報量基準による統計解析入門」を参照して、前記した相対的な共起性による頻度分布を考慮に入れた共起度数Ｃ_ＯＯＣを、次式のように定義する。ＬＬは対数尤度である。

ただし、Ｎ＝ｎ_１１＋ｎ_１２＋ｎ_２１＋ｎ_２２である。
ここで、マイナスの共起を除去するために、次の条件を付ける。
ｎ_１１／（ｎ_１１＋ｎ_２１）＞ｎ_１２／（ｎ_１２＋ｎ_２２）
上記の条件を満たすものだけを、プラスの共起語と認定する。
【００２５】
以上のようにして定義した対数尤度に従って、ある語に対する共起語は順序付けられる。これを、共起度と呼ぶことにする。
ステップＳ４により求められた共起語表の一例を、図５に示す。図５は、「光年」を中心とした場合の尤度である。このような共起語表が、図１の分野別共起語ＤＢ１４に格納される。
【００２６】
次に、前記検索処理部２の動作を、図１および図６のフローチャートを参照して説明する。
ステップＳ１１では、検索要求（検索語、断片文などのテキスト）の受付を行う。検索要求解析部２１は、形態素解析辞書１３を参照することにより、入力された検索語を同定する。ステップＳ１２では、検索対象分野の指定を行う。すなわち、ユーザから、検索目的に合致する分野の選択を受ける。該同定された検索語と指定された検索対象分野とは、検索要求拡張部２２に送られ、該検索要求拡張部２２は、参照すべき共起語ＤＢを決定する。
【００２７】
ステップＳ１３では、検索要求拡張部２２は、分野別共起語ＤＢ１４を参照して、検索要求の拡張（共起語の追加）または追加補完を行う。すなわち、ステップＳ１１の検索語に対応する共起語ＤＢ内の共起語表を参照し、尤度の高い順に、共起語を１または複数語追加する。例えば、検索語「光年」に対しては、「距離」、「銀河」、「星」、・・・の順に追加する。なお、本発明者の実験によれば、２または３語の追加が好適であることが分かったが、これに限定されるものではない。
【００２８】
次に、ステップＳ１４では、検索実行部２３が、文書検索を実行する。すなわち、ステップＳ１３で共起語を追加された検索要求を新たな検索要求として、検索実行部２３は、例えばインターネット上の検索対象文書２４に対して類似文書検索を実行する。そして、検索結果を類似度順に出力する。
【００２９】
最後に、ステップＳ１５において、検索結果表示部２４は検索結果を表示する。該検索結果の表示では、検索結果の一覧を、検索要求に対する類似度順に表示すると共に、対象文書を選択および表示する。
【００３０】
図７、図８は、本発明者が、２１個の検索語について行った実験結果を集計し、再現率と検索精度との関係をグラフで表した結果を示す。図７は、学習分野を限定して共起語情報を用いた場合の検索結果、図８は、学習分野全体の共起語情報を用いた場合の検索結果を示す。
【００３１】
グラフの横軸の再現率は、前記検索結果表示部２４に例えばＡ個（Ａは、正の整数）の文書が類似度順に一覧として表示された場合、再現率０．０は該類似度が最上位の文書、再現率０．１は最上位から（０．１×Ａ）個までの文書、再現率０．２は最上位から（０．２×Ａ）個までの文書、・・・、再現率１．０は最上位から（１．０×Ａ）個までの文書を意味する。
【００３２】
図７、図８から、再現率が０．０〜０．１では、検索語に共起語を２または３語加えて類似検索をした場合に、良好な検索精度が得られることが分かった。
【００３３】
また、図１に示されているように、検索結果表示部２４に提示された文書の中から目的にかなう文書を適宜選択し、該文書を事例登録部１１にフィードバックして、前記図２で説明した共起語情報抽出動作を行うことにより、前記分野別共起語ＤＢ１４の内容をより高精度に更新することができるようになる。
【００３４】
【発明の効果】
以上の説明から明らかなように、本発明によれば、検索要求に共起語を追加補完して検索要求を拡張するようにしたので、検索結果の中から適合文書を効率良く発見することができるようになる。また、該共起語を追加補完するにあたって、検索要求に対応する的確な情報を参照することができるようになる。このため、特に、中学、高校、大学等の学習情報の検索効率、各分野の研究者などの研究情報の検索効率等を、大きく向上させることができるようになる。
【図面の簡単な説明】
【図１】本発明の一実施形態の構成を示すブロック図である。
【図２】本発明の共起語情報抽出処理のフローチャートである。
【図３】分野と、収集したページ数と、抽出した名詞の実験例を示す図である。
【図４】共起語同士の出現態様を区分けした図である。
【図５】共起語表の一具体例を示す図である。
【図６】検索実行の動作を示すフローチャートである。
【図７】学習分野を限定して共起語情報を用いた場合の検索結果の実験例を示す図である。
【図８】学習分野全体の共起語情報を用いた場合の検索結果の実験例を示す図である。
【符号の説明】
１・・・共起語情報抽出部、２・・・検索処理部、１１・・・事例登録部、１２・・・共起語情報抽出部、１３・・・形態素解析辞書、１４・・・分野別共起語ＤＢ、２１・・・検索要求解析部、２２・・・検索要求拡張部、２３・・・検索実行部、２４・・・検索結果表示部、２５・・・検索対象文書。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search system, and more particularly to a document search system in which a search request can be expanded based on co-occurrence word information extracted from useful documents for each field.
[0002]
[Prior art]
Conventional information search methods include keyword search methods for full text or index databases, and similarities (associative associations) based on word space distributions and probability models for search requests and search target documents. ) Document retrieval methods are known. However, in these methods, a search is performed using information (keywords, phrases, etc.) given as a search request as a clue, and thus a document suitable for the purpose cannot always be searched efficiently.
[0003]
In view of this, there has been proposed an adaptive feedback method in which important words extracted from a document judged useful by a user in a search result are added to an initial search request, or a pseudo adaptive feedback method without user's judgment. In addition, a method has been proposed in which a related word extracted from some knowledge source related to a search target is shown to the user, and selective addition to a search request is allowed.
[0004]
Generally, when a search request is ambiguous or fragmented, it is very difficult to search for a target document from a large amount of search results. Since it can be expanded in some way and reflected in the search purpose, it is considered a meaningful method.
[0005]
[Problems to be solved by the invention]
However, the adaptive feedback method has a problem in that it places a large judgment load on the search user. In addition, although the pseudo-adapted feedback method does not go through the user's judgment, since the important words are automatically added from the higher-order document group of the search result, the search accuracy is not always improved efficiently. was there.
[0006]
In addition, since a method that allows selective addition to the search request determines a related word by referring to a hierarchical thesaurus or a series of documents, there is a basis for specifying it as an appropriate information source. There was a problem that it was often ambiguous.
[0007]
Therefore, in the conventional information retrieval method described above, it is uncertain whether complicated procedures are required or sufficiently accurate information is referred to when performing expansion such as adding an appropriate word to a retrieval request. There was a problem.
[0008]
The present invention has been made in view of the above-described prior art, and an object of the present invention is to provide a document search system that supports efficient discovery of matching documents from search results.
[0009]
[Means for Solving the Problems]
In order to achieve the above-described object, the present invention provides a co-occurrence word database in which a co-occurrence word table is stored, and a search request expansion unit that refers to the co-occurrence word database according to the search request and extends the search request. A search execution unit that searches for similar documents using the search request and the co-occurrence words added by the search request extension unit, and a document obtained by similar document search of the search execution unit And a means for creating the co-occurrence word database includes a noun extraction means for performing morphological analysis on a document on which a co-occurrence word is extracted and extracting a noun from the document, In the combination of two extracted noun words, the number of documents in which the two words appear together, the number of documents in which one of the two words appears and the other word does not appear, the other of the two words The number of documents in which a word appears but no one appears, Using a number of documents that do not appear both fine the two words, the likelihood calculation means for the two words to calculate the likelihood that represents the likelihood that co-occur, ordered in the sequence of likelihood obtained by the該尤calculation means It is characterized in that it comprises a means for creating a co-occurrence word table.
[0011]
According to the above feature, a co-occurrence word can be added to the search request and the search can be executed, so that a matching document can be efficiently found from the search result.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a document search system of the present invention.
[0013]
The document search system is roughly divided into a co-occurrence word information extraction unit 1 and a search processing unit 2 as shown in the figure. The co-occurrence word information extraction unit 1 refers to a case registration unit 11 that registers a document (case) that is a basis for co-occurrence word extraction, and refers to a morphological analysis dictionary 13 from cases registered in the case registration unit 11. A co-occurrence word information extracting unit 12 for extracting co-occurrence word information, and a field-specific co-occurrence word DB (database) 14 that accumulates the co-occurrence word information extracted by the co-occurrence word information extraction unit 12 for each field. It is composed of On the other hand, the search processing unit 2 expands or supplements the search request with reference to the search request analysis unit 21 that designates a field and gives a search request (search word) and the field co-occurrence word DB 14. A search request extension unit 22, a search execution unit 23 for searching a search target document 25 such as a document on the Internet using the search request extended by the search request extension unit 22, and a search result display unit 24 Has been.
[0014]
Next, the function of the co-occurrence word information extraction unit 1 will be described in detail with reference to FIGS. FIG. 2 is a flowchart showing the function of the co-occurrence word information extraction unit 1.
[0015]
First, the principle or concept of co-occurrence word extraction will be described. For example, in the textbook area of “Science”, if you want to search for learning information related to “Mars”, an astronomy-related term, the search intention is ambiguous if only “Mars” is used as the search term. Depending on the range, a large number of nonconforming documents will be searched. If a word that frequently co-occurs with “Mars” is added to “Mars” in the document group that is determined to be usable for learning, the search results will be narrowed down further. In this way, by adding a search term that co-occurs frequently and executing a similar document search, it is possible to achieve both accuracy and recall that cannot be obtained by conventional keyword AND search or OR search. It is considered possible. The above is the principle or concept of the co-occurrence word extraction.
[0016]
Therefore, in step S <b> 1 of FIG. 2, the operator registers, in the case registration unit 11, a document that has been used or determined to be usable. At the time of this registration, it is preferable to register the fields together. For example, a large number of documents that are judged to be usable in the textbook area of “Science” are collected, fields such as astronomy, geology, physics, chemistry, animals, plants, weather, etc. are designated and the case registration unit 11 is designated. sign up.
[0017]
Next, in step S2, the co-occurrence word information extraction unit 12 refers to the morphological analysis dictionary 13 and performs morphological analysis on each document for each field, extracts only nouns from the document, The number of documents (or the number of pages) in which the word) appears is calculated. An example of the experiment is shown in FIG. In FIG. 3, the extracted nouns are arranged in the order of frequency as representative words.
[0018]
As in step S2, simply selecting words with a high co-occurrence frequency in order does not reflect the relative co-occurrence of the original word and the co-occurrence word alone. Therefore, the processes of steps S3 and S4 are performed.
[0019]
That is, in step S3, for each field, among the extracted nouns (different words), for all combinations of two words, the number of documents that appear simultaneously, the number of documents in which only one word appears, and both The number of documents that do not appear in any word is calculated.
[0020]
In general, as shown in FIG. 4, two words combination _w j, in _{w k,} two words _w j, a number of _{documents w k} appears both _n 11, _{w j} appeared, _{w k} does not appear It does not appear the number of documents _n 12, _{w j} is, _{n 21} a number of _{documents w k} appears, and _w j, the number of documents that do not appear _{w k} together with _{n 22.}
[0021]
Next, in step S4, for each field, for each word, likelihood calculation using the result of step S3 is performed on other words that co-occur with it, and the co-occurrence word table ordered in the order of likelihood. That is, a co-occurrence word DB is created.
[0022]
Hereinafter, the process of step S4 will be described in more detail.
Now, let D {d ₁ , d ₂ ,..., D _N } be a document set for collecting co-occurrence words, and {w ₁ , w ₂ ,..., W _M } for different words in the set. . Here, d _i represents an individual document, and w _j represents an individual different word.
[0023]
Then, the co-occurrence frequency C can be expressed as follows.
If w _j and w _k appear in the same document, C (d _i , w _j , w _k ) = 1, and if w _j and w _k do not appear in the same document, C (d _i , w _j , w _k ) = 0,
[0024]
Next, referring to “Introduction to Statistical Analysis Based on Information Criteria” published in 1995 by Kodansha Scientific, the co-occurrence frequency C _OOC taking into account the frequency distribution due to the relative co-occurrence described above is Define it like an expression. LL is the log likelihood.

_However, it is _{_{N = n 11 + n 12 +}} n 21 + n 22.
Here, in order to remove the negative co-occurrence, the following condition is attached.
n ₁₁ / (n ₁₁ + n ₂₁ )> n ₁₂ / (n ₁₂ + n ₂₂ )
Only those that satisfy the above conditions are recognized as positive co-occurrence words.
[0025]
According to the log likelihood defined as described above, co-occurrence words for a certain word are ordered. This is called the co-occurrence degree.
An example of the co-occurrence word table obtained in step S4 is shown in FIG. FIG. 5 shows the likelihood when the “light year” is the center. Such a co-occurrence word table is stored in the field-specific co-occurrence word DB 14 of FIG.
[0026]
Next, the operation of the search processing unit 2 will be described with reference to the flowcharts of FIGS.
In step S11, a search request (text such as a search word and a fragment sentence) is accepted. The search request analysis unit 21 identifies the input search word by referring to the morpheme analysis dictionary 13. In step S12, a search target field is designated. That is, the user receives a selection of a field that matches the search purpose. The identified search term and the designated search target field are sent to the search request extension unit 22, and the search request extension unit 22 determines the co-occurrence word DB to be referred to.
[0027]
In step S13, the search request expansion unit 22 refers to the field-specific co-occurrence word DB 14 and performs search request expansion (addition of co-occurrence words) or additional complement. That is, referring to the co-occurrence word table in the co-occurrence word DB corresponding to the search word in step S11, one or a plurality of co-occurrence words are added in descending order of likelihood. For example, the search term “light year” is added in the order of “distance”, “galaxy”, “star”,. In addition, although it turned out that addition of 2 or 3 words is suitable according to experiment of this inventor, it is not limited to this.
[0028]
Next, in step S14, the search execution unit 23 executes document search. That is, the search execution unit 23 executes a similar document search for the search target document 24 on the Internet, for example, with the search request to which the co-occurrence word is added in step S13 as a new search request. The search results are output in the order of similarity.
[0029]
Finally, in step S15, the search result display unit 24 displays the search result. In displaying the search results, a list of search results is displayed in order of similarity to the search request, and the target document is selected and displayed.
[0030]
FIG. 7 and FIG. 8 show the results of totaling the results of experiments conducted by the inventor for 21 search terms and representing the relationship between the recall and the search accuracy in a graph. FIG. 7 shows a search result when co-occurrence word information is used by limiting the learning field, and FIG. 8 shows a search result when co-occurrence word information of the entire learning field is used.
[0031]
The reproduction rate on the horizontal axis of the graph is such that when, for example, A documents (A is a positive integer) are displayed as a list in order of similarity in the search result display unit 24, the recall 0.0 is the similarity. Top-level document, recall rate 0.1 is from the top to (0.1 × A) documents, recall rate 0.2 is from top-level to (0.2 × A) documents,... The recall ratio of 1.0 means documents from the top to (1.0 × A).
[0032]
7 and 8, it was found that when the recall ratio is 0.0 to 0.1, a similar search can be obtained when two or three co-occurrence words are added to the search word and a similar search is performed. .
[0033]
Further, as shown in FIG. 1, a document that meets the purpose is appropriately selected from the documents presented in the search result display unit 24, and the document is fed back to the case registration unit 11, and the document shown in FIG. By performing the described co-occurrence word information extraction operation, the contents of the field-specific co-occurrence word DB 14 can be updated with higher accuracy.
[0034]
【Effect of the invention】
As is clear from the above description, according to the present invention, the search request is expanded by supplementing the co-occurrence word to the search request, so that a matching document can be efficiently found from the search results. become able to. In addition, when supplementing the co-occurrence word, accurate information corresponding to the search request can be referred to. For this reason, in particular, the retrieval efficiency of learning information for junior high school, high school, university, etc., the retrieval efficiency of research information for researchers in each field, etc. can be greatly improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.
FIG. 2 is a flowchart of co-occurrence word information extraction processing according to the present invention.
FIG. 3 is a diagram showing an experiment example of a field, the number of collected pages, and extracted nouns.
FIG. 4 is a diagram in which appearance modes of co-occurrence words are divided.
FIG. 5 is a diagram showing a specific example of a co-occurrence word table.
FIG. 6 is a flowchart showing search execution operations;
FIG. 7 is a diagram illustrating an experiment example of a search result when co-occurrence word information is used by limiting a learning field.
FIG. 8 is a diagram showing an experimental example of a search result when co-occurrence word information of the entire learning field is used.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Co-occurrence word information extraction part, 2 ... Search processing part, 11 ... Case registration part, 12 ... Co-occurrence word information extraction part, 13 ... Morphological analysis dictionary, 14 ... Field-specific co-occurrence word DB, 21... Search request analysis unit, 22... Search request extension unit, 23... Search execution unit, 24.

Claims

A co-occurrence word database storing a co-occurrence word table;
A search request extension unit for referring to the co-occurrence word database according to the search request and extending the search request;
Using the search request and the co-occurrence words added by the search request extension unit, a search execution unit that searches for similar documents in the search target document;
A search result display unit for displaying a document obtained by the similar document search of the search execution unit,
The means for creating the co-occurrence word database is:
A noun extraction means for performing morphological analysis on a document on which a co-occurrence word is extracted and extracting a noun from the document;
In the two-word combination of extracted nouns, the number of documents in which the two words appear together, the number of documents in which one of the two words appears and the other word does not appear, the other of the two words A likelihood calculating means for calculating a likelihood representing the likelihood of the two words co-occurring using the number of documents in which one word does not appear and the number of documents in which neither of the two words appears ,
A document retrieval system comprising means for creating a co-occurrence word table ordered in the order of likelihood obtained by the likelihood calculating means .

The document search system according to claim 1,
The co-occurrence word database includes a co-occurrence word table for each field.

The document search system according to claim 1 or 2,
The document search system, wherein the search request extension unit extracts the co-occurrence words corresponding to the search request from the co-occurrence word database in descending order of likelihood and adds them to the search request.

The document search system according to any one of claims 1 to 3,
The search result display unit displays a list of search results in order of similarity to a search request.