JP2011192029A

JP2011192029A - Information retrieval device, method and program

Info

Publication number: JP2011192029A
Application number: JP2010057747A
Authority: JP
Inventors: Naoki Fujita; 尚樹藤田; Yamato Takahashi; 大和高橋; Shunsuke Konagai; 俊介小長井; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-03-15
Filing date: 2010-03-15
Publication date: 2011-09-29

Abstract

<P>PROBLEM TO BE SOLVED: To output high-precision search results that take synonyms of a search word into consideration, without causing a delay in the display of the search results during execution of a search. <P>SOLUTION: A search range is analyzed, and, for each word, information about the frequency of occurrence or the like within each electronic document (hereinafter referred to as word frequency information) and information about a synonym ID including the word are combined to create a word index collected as a record of the words. Pieces of word frequency information for the words in the word index are collected for each synonym ID to create a synonym index. The synonym index is referred to while the synonym ID obtained by referring to the word index while using the search word as a key is used as a key. Using the word frequency information obtained, the degree of match with the search word is calculated. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、情報検索装置及び方法及びプログラムに係り、特に、インターネット上の検索エンジンをはじめとする、情報検索装置及び方法及びプログラムに関する。 The present invention relates to an information search apparatus, method, and program, and more particularly, to an information search apparatus, method, and program including search engines on the Internet.

近年、インターネットの普及により、インターネット上には膨大な電子文書群が存在し、利用者がその中から必要とする情報を的確に検索する検索システム及びサービスの重要性が高まっている。一般的な検索サービスでは、ユーザが入力した検索語が検索対象の文書や該文書に対する別の文書からのリンクアンカーテキストに含まれる数に基づき、検索語と文書の一致度と、該文書が別の文書から参照されている度合い等を用いた該文書の重要度とを合わせて検索結果の出力順を決定している。 In recent years, with the widespread use of the Internet, there are an enormous amount of electronic documents on the Internet, and the importance of search systems and services for accurately searching for information required by users is increasing. In a general search service, based on the number of search terms entered by the user included in the search target document or the link anchor text from another document to the document, the match rate between the search term and the document is separated from the document. The output order of the search results is determined together with the importance of the document using the degree of reference from the document.

文書の一致度には「ＢＭ２５」や「ｔｆ・ｉｄｆ」といった単語の統計量を用いた手法が利用されている（例えば、非特許文献１参照）。ここでは特定の文書群全体の平均と比較して文書に高い頻度で現れる単語が、該文書を特徴付けるものであるという推定に基づいて、ユーザが入力した検索語が文書の特徴と一致する度合いが高い文書を高い出力順位としている。 A method using a statistic of a word such as “BM25” or “tf · idf” is used for the degree of matching of documents (for example, see Non-Patent Document 1). Here, based on the presumption that words that appear more frequently in a document than the average of a specific group of documents characterize the document, the degree to which the search term input by the user matches the document characteristics A high document has a high output ranking.

この手法によれば、検索語が比較的珍しい単語であれば良好な検索結果が得られるが、検索語が極ありふれた単語であれば同程度の一致度となる文書が多くなりすぎてしまう。一般的な情報検索サービスでは、検索語との一致度が同程度となった文書が多い場合に、順位付けのために文書毎の重要度を利用して、検索結果の出力順を決定している。 According to this method, a good search result can be obtained if the search word is a relatively rare word, but if the search word is a very common word, too many documents have the same degree of matching. In general information search services, when there are many documents that have the same degree of matching with the search terms, the order of search results is determined using the importance of each document for ranking. Yes.

文書の重要度としては、ＰａｇｅＲａｎｋ（例えば、非特許文献２参照）やＨＩＴＳ（例えば、非特許文献３参照）といった手法が一般的に利用されている。これらの手法は、Ｗｅｂページのリンク情報を用いて、特定の文書が他の多くの文書からリンクされている場合にはその文書が重要であろうという推定に基づいている。 As the importance of a document, techniques such as PageRank (for example, see Non-Patent Document 2) and HITS (for example, see Non-Patent Document 3) are generally used. These techniques are based on the assumption that a particular document will be important if it is linked from many other documents using the link information of the web page.

また、インターネットにおける検索サービスを利用するユーザにとって、入力する検索語が含まれるＷｅｂページ以外にも、検索語の同義語が含まれているＷｅｂページもユーザの意図に沿ったものであることもある。ユーザが探しているものを意味する単語が複数あり、各Ｗｅｂページで異なる単語で書かれている場合があるからである（例：「秋葉原」と「アキバ」、「セール」と「バーゲン」、など）。このように同じ物や事柄などを同じく意味する異なる単語を同義語や類義語と呼び（本明細書では以下「同義語」と記す）、同義語辞書として数多くまとめられている。 In addition to a Web page that includes a search term to be input, a Web page that includes a synonym for the search term may also be in line with the user's intention for a user who uses a search service on the Internet. . This is because there are a number of words that mean what the user is looking for, and there are cases where each Web page is written with a different word (for example, “Akihabara” and “Akiba”, “Sale” and “Bargain”, Such). Different words that mean the same thing or matter in the same way are called synonyms and synonyms (hereinafter referred to as “synonyms” in the present specification), and a number of synonym dictionaries are collected.

上記のように、ユーザが入力した検索語は無いが同義語は含まれており、ユーザの検索意図に沿ったページを検索結果として出力する手段として、検索語に対して検索システムが自動的もしくは利用者の応答を介して同義語を検索語に追加及び置換し、それを用いて検索する方式などがある。（例えば、特許文献１参照）。 As described above, there is no search term input by the user, but synonyms are included. As a means for outputting a page according to the user's search intention as a search result, the search system automatically or There is a method in which a synonym is added to and replaced with a search word through a user response and a search is performed using the same. (For example, refer to Patent Document 1).

特開２００４−１６４６６２号公報JP 2004-164661 A

S Robertson, H Zaragoza, M Taylor 'Simple BM25 extension to multiple weighted fields' Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004.S Robertson, H Zaragoza, M Taylor 'Simple BM25 extension to multiple weighted fields' Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004. Lawrence Page, Sergey Brin, Rajeev Motwai, Terry Winograd, 'The PageRank Citation Ranking: Bringing Order to the Web', 7th International World Wide Web conference (WWW98).Lawrence Page, Sergey Brin, Rajeev Motwai, Terry Winograd, 'The PageRank Citation Ranking: Bringing Order to the Web', 7th International World Wide Web conference (WWW98). Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), v.46 n.5, p.604-632, Sept. 1999.Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), v.46 n.5, p.604-632, Sept. 1999.

しかしながら、従来のこのような同義語を考慮した検索方式には次の問題が存在する。 However, there are the following problems in the conventional search method considering such synonyms.

検索語が多数の同義語を持つものであった場合、検索システムが同義語を検索語に追加をすると、多数の単語を用いた検索となり、検索システムにおける一致度の算出や、検索結果表示順の計算に要する処理が大きくなってしまう。 If the search term has many synonyms, if the search system adds the synonym to the search term, the search will use many words, and the search system will calculate the degree of matching and the search result display order. The processing required for the calculation becomes large.

通常、検索システムは検索対象となるＷｅｂページを任意の単位（以下「単語」）に分割したインデックスを保持しており、インデックス内を、検索語をキーとして適合するＷｅｂページを参照しているため、検索語に多数の同義語が追加されると、文書インデックス内を参照する回数が増加する。また、ページと検索語に対する一致度は各組み合わせ毎に計算され、検索語が複数の場合には各検索語と当該ページとの一致度を任意の割合や均等に足し合わせたり、任意の関数で集約したものを用いる。検索語が増加すれば、各検索語と各ページの一致度を算出するための計算量が増加し、各検索語及び同義語毎に算出された各ページの一致度を突合せながらページ毎に集約する作業にも時間を要してしまうため、ユーザへの応答時間が長くなり、利便性を低下させてしまう。同義語の数を制限すれば、結果表示は迅速に表示されるが、除外された同義語が含まれる文書が検索されず、ユーザを満足させる検索結果が得られない可能性が高くなる。 Usually, a search system holds an index obtained by dividing a Web page to be searched into arbitrary units (hereinafter “words”), and refers to a Web page that matches the search word as a key in the index. When a large number of synonyms are added to the search term, the number of times of reference in the document index increases. In addition, the degree of coincidence between the page and the search term is calculated for each combination. When there are multiple search terms, the degree of coincidence between each search term and the page can be added in an arbitrary ratio or evenly, or with any function. Use aggregated ones. If the number of search terms increases, the amount of calculation for calculating the degree of coincidence between each search term and each page increases, and the amount of coincidence of each page calculated for each search term and synonym is aggregated for each page. Since the work to be performed also takes time, the response time to the user becomes long and the convenience is lowered. If the number of synonyms is limited, the result display is displayed quickly, but a document including the excluded synonyms is not searched, and there is a high possibility that a search result that satisfies the user cannot be obtained.

本発明は上記課題を解決するものであり、その目的は検索実行時に検索結果の表示に遅延を生じることなく、検索語の同義語を考慮した高精度の検索結果を出力することができる情報検索装置及び方法及びプログラムを提供することである。 SUMMARY OF THE INVENTION The present invention solves the above-mentioned problems, and its purpose is to search for information that can output a high-precision search result considering synonyms of a search term without causing a delay in the display of the search result when executing a search. An apparatus, method, and program are provided.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、ユーザ端末から指示された検索語（単語単位）を用いて電子文書群を検索する際に、検索範囲となる電子文書を任意の単位（以下、「単語」と記す）に分割して格納したインデックスを参照して該検索語との一致度を算出する情報検索装置であって、
予め同義語の関係にある単語同士をグループ化し、グループ毎にＩＤ付けした情報（以下、「同義語ＩＤ」と記す）を持つ同義語テーブルを格納した同義語テーブル記憶手段１０１と、
検索範囲を解析し、単語毎に各電子文書内での出現頻度等の情報（以下、「単語頻度情報」と記す）と、当該単語の含まれる同義語ＩＤの情報を合わせて当該単語のレコードとして集約した単語インデックスを作成し、単語インデックス記憶手段１０２に格納するインデックス作成手段１１０と、
単語インデックス記憶手段１０２の単語インデックスの単語毎の単語頻度情報を同義語ＩＤ毎に集約した同義語インデックスを作成し、同義語インデックス記憶手段１０３に格納する同義語インデックス作成手段１２０と、
検索語との一致度を、検索語をキーとして単語インデックス記憶手段１０２を参照して得られた同義語ＩＤをキーとして同義語インデックス記憶手段１０３を参照して取得した単語頻度情報を用いて算出する検索語一致度計算手段１３０と、を有する。 According to the present invention (claim 1), when searching for an electronic document group using a search word (word unit) instructed from a user terminal, an electronic document serving as a search range is referred to as an arbitrary unit (hereinafter referred to as “word”). An information search device that calculates the degree of coincidence with the search term by referring to the index divided and stored
Synonym table storage means 101 storing a synonym table having information (hereinafter, referred to as “synonym ID”) in which words having synonym relations are grouped in advance and ID is assigned to each group;
Analyzing the search range, for each word, combining information such as the appearance frequency in each electronic document (hereinafter referred to as “word frequency information”) and the information of the synonym ID included in the word together with the record of the word An index creation unit 110 that creates a word index aggregated as follows and stores it in the word index storage unit 102;
A synonym index creating unit 120 that creates a synonym index in which word frequency information for each word in the word index of the word index storage unit 102 is aggregated for each synonym ID and stores the synonym index in the synonym index storage unit 103;
The degree of coincidence with the search term is calculated using the word frequency information obtained by referring to the synonym index storage unit 103 using the synonym ID obtained by referring to the word index storage unit 102 using the search term as a key. And a search word matching degree calculation means 130.

また、本発明（請求項２）は、検索語一致度計算手段１３０において、
検索語との一致度を計算する際に、検索語をキーとして単語インデックス記憶手段を参照し、当該単語の同義語ＩＤに加えて、単語頻度情報も取得しておき、同義語インデックス記憶手段を参照して得られた同義語ＩＤの単語頻度情報と併せて、一致度の算出に用いる。 Further, the present invention (Claim 2), in the search word matching degree calculation means 130,
When calculating the degree of coincidence with the search word, the word index storage means is referred to using the search word as a key, and the word frequency information is acquired in addition to the synonym ID of the word, and the synonym index storage means Together with the word frequency information of the synonym ID obtained by reference, it is used to calculate the degree of coincidence.

また、本発明（請求項３）は、同義語インデックス作成手段１２０において、
単語インデックス記憶手段の情報から各単語頻度情報の集約の際に任意に設定した条件を用いて、情報を取捨選択して同義語インデックスとして同義語インデックス記憶手段に格納する。 Further, the present invention (Claim 3) is the synonym index creating means 120,
Information is selected and stored in the synonym index storage means as a synonym index using conditions arbitrarily set when the word frequency information is aggregated from the information in the word index storage means.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項４）は、ユーザ端末から指示された検索語（単語単位）を用いて電子文書群を検索する際に、検索範囲となる電子文書を任意の単位（以下、「単語」と記す）に分割して格納したインデックスを参照して該検索語との一致度を算出する情報検索方法であって、
予め同義語の関係にある単語同士をグループ化し、グループ毎にＩＤ付けした情報（以下、「同義語ＩＤ」と記す）を持つ同義語テーブルを格納した同義語テーブル記憶手段と、
単語インデックスを格納する単語インデックス記憶手段と、
同義語インデックスを格納する同義語インデックス記憶手段と、
を有する装置が、
検索範囲を解析し、単語毎に各電子文書内での出現頻度等の情報（以下、「単語頻度情報」と記す）と、当該単語の含まれる同義語ＩＤの情報を合わせて当該単語のレコードとして集約した単語インデックスを作成し、単語インデックス記憶手段に格納するインデックス作成ステップ（ステップ１）と、
単語インデックス記憶手段の単語インデックスの単語毎の単語頻度情報を同義語ＩＤ毎に集約した同義語インデックスを作成し、同義語インデックス記憶手段に格納する同義語インデックス作成ステップ（ステップ２）と、
検索語との一致度を、検索語をキーとして単語インデックス記憶手段を参照して得られた同義語ＩＤをキーとして同義語インデックス記憶手段を参照して取得した単語頻度情報を用いて算出する検索語一致度計算ステップ（ステップ３）と、を行う。 According to the present invention (claim 4), when searching for an electronic document group using a search word (word unit) instructed from a user terminal, an electronic document as a search range is referred to as an arbitrary unit (hereinafter referred to as “word”). An information search method for calculating the degree of coincidence with the search term by referring to the index divided and stored
Synonym table storage means storing a synonym table having information (hereinafter referred to as “synonym ID”) in which words in synonym relations are grouped in advance and ID is assigned to each group;
Word index storage means for storing a word index;
Synonym index storage means for storing a synonym index;
A device having
Analyzing the search range, for each word, combining information such as the appearance frequency in each electronic document (hereinafter referred to as “word frequency information”) and the information of the synonym ID included in the word together with the record of the word An index creation step (step 1) for creating a word index aggregated as follows and storing it in a word index storage means;
A synonym index creating step (step 2) for creating a synonym index in which word frequency information for each word in the word index of the word index storage unit is aggregated for each synonym ID and storing the synonym index in the synonym index storage unit;
A search for calculating the degree of coincidence with a search word using word frequency information obtained by referring to the synonym index storage means using the synonym ID obtained by referring to the word index storage means using the search words as keys. A word matching degree calculation step (step 3) is performed.

また、本発明（請求項５）は、検索語一致度計算ステップ（ステップ３）において、
検索語との一致度を計算する際に、検索語をキーとして単語インデックス記憶手段を参照し、当該単語の同義語ＩＤに加えて、単語頻度情報も取得しておき、同義語インデックス記憶手段を参照して得られた同義語ＩＤの単語頻度情報と併せて、一致度の算出に用いる。 Further, the present invention (Claim 5), in the search word matching degree calculation step (Step 3),
When calculating the degree of coincidence with the search word, the word index storage means is referred to using the search word as a key, and the word frequency information is acquired in addition to the synonym ID of the word, and the synonym index storage means Together with the word frequency information of the synonym ID obtained by reference, it is used to calculate the degree of coincidence.

また、本発明（請求項６）は、同義語インデックス作成ステップ（ステップ２）において、
単語インデックス記憶手段の情報から各単語頻度情報の集約の際に任意に設定した条件を用いて、情報を取捨選択して同義語インデックスとして同義語インデックス記憶手段に格納する。 Further, the present invention (Claim 6), in the synonym index creation step (Step 2),
Information is selected and stored in the synonym index storage means as a synonym index using conditions arbitrarily set when the word frequency information is aggregated from the information in the word index storage means.

本発明（請求項７）は、請求項１乃至３のいずれか１項に記載の情報検索装置を構成する各手段としてコンピュータを機能させるための情報検索プログラムである。 The present invention (Claim 7) is an information search program for causing a computer to function as each means constituting the information search apparatus according to any one of Claims 1 to 3.

上記のように本発明によれば、以下のような効果を奏する。 As described above, the present invention has the following effects.

（１）検索語の同義語を多く含んでいても、インデックスの参照は単語インデックスを検索語を用いて参照する１回と、そこで得られた同義語ＩＤを用いて同義語インデックスを参照する２回だけでよく、検索語の同義語が２つ以上ある場合に、インデックスの参照回数が削減され、検索に要する時間を短縮できる。 (1) Even if many synonyms of the search word are included, the index is referred to once by referring to the word index by using the search word and by referring to the synonym index by using the synonym ID obtained there. If there are two or more synonyms of the search term, the number of index references is reduced, and the time required for the search can be shortened.

（２）検索語の同義語を多く含んでいても、各ページの検索語に対する一致度は同義語インデックスからの単語頻度情報から計算される１つの値だけであるため、従来の技術で行っていた、各検索語及び同義語毎に計算された一致度の突合せ処理が不要となり、検索に要する時間を短縮できる。 (2) Even if many synonyms of the search word are included, the degree of coincidence with respect to the search word of each page is only one value calculated from the word frequency information from the synonym index. In addition, the matching process of the degree of coincidence calculated for each search word and synonym is unnecessary, and the time required for the search can be shortened.

（３）請求項２、５の発明によれば、検索語を含む電子文書を優先し、検索結果の上位とするとことで、よりユーザの満足度の高い検索結果をユーザに提供することができる。 (3) According to the inventions of claims 2 and 5, priority is given to the electronic document including the search word, and the search result is ranked higher, so that the search result with higher user satisfaction can be provided to the user. .

（４）請求項３、６の発明によれば、単語頻度情報が低い電子文書は検索結果の下位となることが多いため、検索結果の上位に重点を置き単語頻度を絞り込むことで、全体の計算量を削減し、より迅速な検索結果をユーザに提供することができる。 (4) According to the inventions of claims 3 and 6, since an electronic document with low word frequency information is often lower in the search result, the word frequency is narrowed down by placing emphasis on the upper position of the search result. The amount of calculation can be reduced, and a quicker search result can be provided to the user.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の第１の実施の形態における情報検索システムの構成図である。It is a block diagram of the information search system in the 1st Embodiment of this invention. 本発明の第２の実施の形態における情報検索システムの構成図である。It is a block diagram of the information search system in the 2nd Embodiment of this invention. 従来の検索手法を示す図である。It is a figure which shows the conventional search method. 本発明の手法と従来技術の手法との比較を示す図である。It is a figure which shows the comparison with the method of this invention, and the method of a prior art.

以下、図面と共に本発明の実施の形態を説明する。
以下、本発明の情報検索装置は、電子文書を検索対象とし、電子文書内（以下、「検索範囲」を記す）の検索語の有無に加えて、同義語を考慮し、検索語と各電子文書の一致度を計算し、その一致度に基づいて検索結果の出力を実行する装置である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Hereinafter, an information search apparatus according to the present invention targets an electronic document as a search target, considers synonyms in addition to presence / absence of search words in the electronic document (hereinafter referred to as “search range”), This is a device that calculates the matching level of documents and outputs search results based on the matching level.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における情報検索システムの構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the information search system according to the first embodiment of the present invention.

同図に示す情報検索システムは、情報検索装置１００、検索対象文書記憶装置２００、情報検索端末３００から構成される。 The information search system shown in FIG. 1 includes an information search device 100, a search target document storage device 200, and an information search terminal 300.

情報検索装置１００は、インデックス機能部１１０、同義語インデックス機能部１２０、検索語一致度計算部１３０、総合ランキング計算部１４０、同義語テーブル記憶部１０１、単語インデックス記憶部１０２、同義語インデックス記憶部１０３から構成される。 The information search apparatus 100 includes an index function unit 110, a synonym index function unit 120, a search word matching degree calculation unit 130, a comprehensive ranking calculation unit 140, a synonym table storage unit 101, a word index storage unit 102, and a synonym index storage unit. 103.

インデックス機能部１１０は、外部の検索対象文書記憶装置２００に接続されており、また、検索語一致度計算部１３０と総合ランキング計算部１４０は情報検索端末３００に接続されており、インターネット等のネットワークを通じてデータの送受信が可能である。 The index function unit 110 is connected to an external search target document storage device 200, and the search word matching degree calculation unit 130 and the overall ranking calculation unit 140 are connected to the information search terminal 300, and are connected to a network such as the Internet. Data can be transmitted and received through the network.

情報検索端末３００は、パーソナルコンピュータ（ＰＣ等や携帯やＰＤＡ含む）からなり、ユーザ入力の検索要求（検索語含む）を情報検索装置１００の検索一致度計算部１３０に送信し、その検索要求に対する検索結果を総合ランキング計算部１４０から受信する。 The information search terminal 300 is composed of a personal computer (including a PC, a mobile phone, and a PDA), and transmits a user input search request (including a search word) to the search matching degree calculation unit 130 of the information search apparatus 100, in response to the search request. The search result is received from the general ranking calculation unit 140.

情報検索装置１００は、予め全文検索用の単語インデックスと同義語インデックスを作成し、それぞれ単語インデックス記憶部１０２と同義語インデックス記憶部１０３に保存しておく事前処理と、情報検索端末３００から検索要求時に送信される検索語に応じて単語インデックス記憶部１０２及び同義語インデックス記憶部１０３を参照して検索結果を作成する検索処理を実施する。 The information search device 100 creates a full-text search word index and a synonym index in advance and stores them in the word index storage unit 102 and the synonym index storage unit 103, respectively, and a search request from the information search terminal 300. A search process for creating a search result is performed by referring to the word index storage unit 102 and the synonym index storage unit 103 according to the search term transmitted at times.

情報検索装置１００は、インターネット上等に存在する電子文書群を検索する検索エンジンのシステムを構成し、通常のコンピュータのハードウェアリソース（ＣＰＵやメモリ、ＨＤＤ、各インタフェース等）を備える。 The information retrieval apparatus 100 constitutes a search engine system that retrieves a group of electronic documents existing on the Internet and the like, and includes normal computer hardware resources (CPU, memory, HDD, interfaces, etc.).

このハードウェアリソースと、その上で動作する各種ソフトウェア（ＯＳやアプリケーション等）により、情報検索装置１００は、検索対象文書記憶装置２００の検索対象の電子文書と同義語テーブル記憶部１０１の同義語情報から単語インデックスを作成し、単語インデックス記憶部１０２に格納するインデックス機能部１１０と、単語インデックス記憶部１０２と同義語テーブル記憶部１０１の情報から同義語インデックスを作成し、同義語インデックス記憶部１０３に格納する同義語インデックス機能部１２０と、情報検索端末３００から送信された検索語に基づき単語インデックス記憶部１０２及び同義語インデックス記憶部１０３を参照して検索語と検索対象の電子文書との一致度を計算する検索語一致度計算部１３０と、検索語一致度計算部１３０の算出した一致度に基づき情報検索端末３００に返却する検索結果の出力順を決定する総合ランキング計算部１４０とを実装する。 With this hardware resource and various software (OS, application, etc.) operating on the hardware resource, the information search device 100 synonym information of the search target electronic document and the synonym table storage unit 101 of the search target document storage device 200 A word index is created from the index function unit 110 stored in the word index storage unit 102, and the synonym index is created from the information in the word index storage unit 102 and the synonym table storage unit 101, and the synonym index storage unit 103 stores the synonym index. The degree of coincidence between the search word and the electronic document to be searched by referring to the synonym index function unit 120 to be stored and the word index storage unit 102 and the synonym index storage unit 103 based on the search word transmitted from the information search terminal 300 A search word matching degree calculation unit 130 for calculating Implementing the overall ranking calculating section 140 for determining an output order of the search results to be returned to the information search terminal 300 based on the calculated degree of matching 致度 calculator 130.

インデックス機能部１１０及び同義語インデックス機能部１２０を通じて事前処理が実施され、検索語一致度計算部１３０及び総合ランキング計算部１４０を通じて検索処理が実施される。また、情報検索装置１００と情報検索端末３００とのデータ送受信は、ハードウェアリソースに含まれる通信インタフェースを通じて実施され、同義語テーブル記憶部１０１、単語インデックス記憶部１０２、同義語インデックス記憶部１０３は、ハードウェアリソースに含まれるハードディスクドライブ装置上に構築される。 Pre-processing is performed through the index function unit 110 and the synonym index function unit 120, and search processing is performed through the search word matching degree calculation unit 130 and the comprehensive ranking calculation unit 140. Data transmission / reception between the information search device 100 and the information search terminal 300 is performed through a communication interface included in the hardware resource. The synonym table storage unit 101, the word index storage unit 102, and the synonym index storage unit 103 are It is constructed on a hard disk drive device included in the hardware resource.

なお、前記同義語テーブル記憶部１０１に記録される同義語に関する情報は、既存のシソーラス辞書や各種辞典の利用により得ることができる。以下、インデックス機能部１１０、同義語インデックス機能部１２０、検索語一致度計算部１３０、総合ランキング計算部１４０の処理内容を、検索対象文書記憶装置２００の電子文書１０，１１を検索対象とする事例に基づき説明する。 Information about synonyms recorded in the synonym table storage unit 101 can be obtained by using an existing thesaurus dictionary or various dictionaries. Hereinafter, the processing contents of the index function unit 110, the synonym index function unit 120, the search word matching degree calculation unit 130, and the overall ranking calculation unit 140 are the cases where the electronic documents 10 and 11 of the search target document storage device 200 are to be searched. Based on

＜事前処理部分＞
・インデックス機能部１１０
インデックス機能部１１０は、検索対象文書の情報を単語、n-gram、サフィックスアレイといった全文検索用の単位（以下「単語」とする）に分割して単語インデックスを作成し、単語頻度情報（（電子文書番号，単語頻度）の組の情報の集合）及び、単語が同義語テーブル記憶部１０１内の同義語テーブルのいずれかの同義語のグループに含まれる場合はその同義語ＩＤを単語インデックス記憶部１０２に保存する。なお、分割単位は任意であり、当該方法以外の方式を用いて分割してもよい。 <Pre-processing part>
Index function unit 110
The index function unit 110 divides the information of the search target document into units for full-text search (hereinafter referred to as “words”) such as words, n-grams, and suffix arrays, creates a word index, and generates word frequency information ((electronic Document number, word frequency)), and if the word is included in any synonym group in the synonym table in the synonym table storage unit 101, the synonym ID is stored in the word index storage unit. Save to 102. The division unit is arbitrary, and the division may be performed using a method other than the method.

ここでは単語インデックス記憶部１０２の単語インデックスの一例として、検索対象文書記憶装置２００の電子文書１０，１１を単語インデックス記憶部１０２に格納している。この文書インデックスには、単語頻度情報以外に通常の全文検索インデックスで利用されている情報が含まれてもよい。「ｈｔｍｌ」による単語のマークアップ統計情報など電子文書単位の情報で他の単語の情報との加算が可能なものは単語頻度情報に加えても良い。 Here, as an example of the word index of the word index storage unit 102, the electronic documents 10 and 11 of the search target document storage device 200 are stored in the word index storage unit 102. This document index may include information used in a normal full-text search index in addition to word frequency information. Information in units of electronic documents that can be added to other word information, such as word markup statistical information based on “html”, may be added to the word frequency information.

・同義語インデックス機能部１２０
同義語インデックス機能部１２０は、単語インデックス記憶部１０２の単語インデックスの単語頻度情報を同義語ＩＤ毎に集約したものを同義語インデックスとして作成し、同義語インデックス記憶部１０３に格納する。各単語頻度情報は電子文書番号が同じならば単語頻度は加算される。・ Synonym index function unit 120
The synonym index function unit 120 creates a synonym index by collecting the word frequency information of the word index in the word index storage unit 102 for each synonym ID, and stores it in the synonym index storage unit 103. If each word frequency information has the same electronic document number, the word frequency is added.

例として、共に同義語ＩＤ「００２」である「特売」と「セール」について説明する。単語インデックスは「特売」の単語頻度情報として「（１０（電子文書番号），１（出現頻度））（１１（電子文書番号），１（出現頻度））」を、「セール」の単語頻度情報として「（１１（電子文書番号），１（出現頻度））」を保持している。これを同義語インデックスに集約すると、同義語ＩＤ「００２」の単語頻度情報「（１０，１）（１１，２）」として同義語インデックス記憶部１０３に保存される。 As an example, “sale” and “sale” having synonym ID “002” will be described. The word index is “(10 (electronic document number), 1 (appearance frequency)) (11 (electronic document number), 1 (appearance frequency))” as the word frequency information of “sale”, and the word frequency information of “sale”. As “(11 (electronic document number), 1 (appearance frequency))”. When this is collected into the synonym index, it is stored in the synonym index storage unit 103 as the word frequency information “(10, 1) (11, 2)” of the synonym ID “002”.

＜検索処理部分＞
・検索語一致度計算部１３０
検索語一致度計算部１３０は、情報検索端末３００から検索語を指定した検索要求を受信後に、検索語をキーとして単語インデックス記憶部１０２を参照し、検索語の同義語ＩＤが無ければ、単語インデックスの単語頻度情報を用いて一致度を計算する。同義語ＩＤがあれば同義語ＩＤを取得し、同義語ＩＤをキーとして同義語インデックス記憶部１０３を参照し、取得した単語頻度情報を用いて一致度を計算する。ここで一致度を非特許文献１のＢＭ２５，ＢＭ２５Ｆ，ｔｆ・ｉｄｆなどの方法で算出する。 <Search processing part>
Search term matching degree calculation unit 130
After receiving the search request specifying the search word from the information search terminal 300, the search word matching degree calculation unit 130 refers to the word index storage unit 102 using the search word as a key, and if there is no synonym ID of the search word, The degree of coincidence is calculated using the word frequency information of the index. If there is a synonym ID, the synonym ID is acquired, the synonym ID is used as a key, the synonym index storage unit 103 is referenced, and the degree of coincidence is calculated using the acquired word frequency information. Here, the degree of coincidence is calculated by a method such as BM25, BM25F, tf · idf of Non-Patent Document 1.

・総合ランキング計算部１４０
総合ランキング計算部１４０は、検索語一致度計算部１３０から転送された一致度の情報に基づき検索結果の出力順を決定する。ここで決定された出力順に従って前記情報検索端末３００に検索結果が送信される。・ Comprehensive ranking calculator 140
The general ranking calculation unit 140 determines the output order of the search results based on the matching degree information transferred from the search word matching degree calculation unit 130. The search result is transmitted to the information search terminal 300 in accordance with the output order determined here.

［第２の実施の形態］
第１の実施の形態に加えて以下の点を変更したものを第２の実施の形態とし、図４に第２の実施の形態における情報検索システムを示す。 [Second Embodiment]
The information search system according to the second embodiment is shown in FIG. 4 as a second embodiment in which the following points are changed in addition to the first embodiment.

図４において、第１の実施の形態に、文書番号と重要度の項目を有する文書重要度テーブル記憶部１０４が加わり、これにより、総合ランキング計算部１４０の処理が異なる。図４において、図３と同一構成部分については同一符号を付し、その説明を省略する。 In FIG. 4, a document importance level table storage unit 104 having a document number and an importance level item is added to the first embodiment, and the processing of the overall ranking calculation unit 140 is thereby different. 4, the same components as those in FIG. 3 are denoted by the same reference numerals, and description thereof is omitted.

・総合ランキング計算部１４０
総合ランキング計算部１４０は、文書番号毎の重要度を格納した文書重要度テーブル記憶部１０４を参照してリスト化された各文書の重要度のリストを取得する。取得したリストに掲載された各電子文書の重要度と検索語一致度計算部１３０から転送された一致度とを総合して検索結果の出力順を決定する。・ Comprehensive ranking calculator 140
The general ranking calculation unit 140 refers to the document importance table storage unit 104 that stores the importance for each document number, and acquires a list of importance of each document listed. The output order of the search results is determined by combining the importance of each electronic document listed in the acquired list and the match transferred from the search word match calculating unit 130.

なお、文書重要度テーブル記憶部１０４に記録される各電子文書の重要度は、非特許文献２、３などに記載の手法により算出する事が出来る。 Note that the importance of each electronic document recorded in the document importance table storage unit 104 can be calculated by the methods described in Non-Patent Documents 2, 3 and the like.

［第３の実施の形態］
本実施の形態は、請求項２，５に対応する。 [Third Embodiment]
This embodiment corresponds to claims 2 and 5.

本実施の形態では、検索語一致度計算部１３０の処理が前述の第１、第２の実施の形態と異なる。当該検索語一覧計算部１３０以外の機能については、第１、第２の実施の形態のいずれを適用してもよい。 In the present embodiment, the processing of the search word matching degree calculation unit 130 is different from the first and second embodiments described above. For functions other than the search word list calculation unit 130, either the first or second embodiment may be applied.

同義語インデックス記憶部１０３からの単語頻度情報と単語インデックス記憶部１０２からの単語頻度情報とを任意の割合で加算する方式は下記のように実施する。 A method of adding the word frequency information from the synonym index storage unit 103 and the word frequency information from the word index storage unit 102 at an arbitrary ratio is performed as follows.

単語「特売」での検索をした場合、単語インデックス記憶部１０２からの単語頻度情報は、
（電子文書１０、電子文書１１）＝（ｔ_ｔ１０＝１，ｔ_ｔ１１＝１）
であり、同義語インデックス記憶部１０３から同義語ＩＤを利用しての単語頻度情報は
（電子文書１０、電子文書１１）＝（ｔ_ｄ１０＝１，ｔ_ｄ１１＝２）
であり、同義語インデックス記憶部１０３の単語頻度情報と単語インデックス記憶部１０２の単語頻度情報を１：αの割合で加算をする際は、
（電子文書１０、電子文書１１）＝（ｔ_ｄ１０＋（α×ｔ_ｔ１０），ｔ_ｄ１１＋（α×ｔ_ｔ１１））
となる。但し、（ｔ_ｄ１０，ｔ_ｄ１１）には（ｔ_ｔ１０，ｔ_ｔ１１）が含まれているため、α＝１の際には「特売」は他の同義語「セール」等に対して２倍の重みを与えられている事となる。 When searching for the word “special sale”, the word frequency information from the word index storage unit 102 is:
(Electronic document 10, electronic document 11) = (t _t10 = 1, t _t11 = 1)
The word frequency information using the synonym ID from the synonym index storage unit 103 is (electronic document 10, electronic document 11) = (t _d10 = 1, t _d11 = 2)
When adding the word frequency information in the synonym index storage unit 103 and the word frequency information in the word index storage unit 102 at a ratio of 1: α,
(Electronic document 10, electronic document 11) = (t _d10 + (α × t _t10 ), t _d11 + (α × t _t11 ))
It becomes. However, since (t _d10 , t _d11 ) includes (t _t10 , t _t11 ), when α = 1, “special sale” is twice as large as other synonyms “sale” etc. It will be given weight.

［第４の実施の形態］
本実施の形態は、請求項３、６に対応する。 [Fourth Embodiment]
This embodiment corresponds to claims 3 and 6.

本実施の形態は、前述の実施の形態の処理とは同義語インデックス機能部１２０の処理が異なる。なお、他の機能については、第１〜第３の実施の形態のいずれを適用してもよい。 In the present embodiment, the processing of the synonym index function unit 120 is different from the processing of the above-described embodiment. For other functions, any of the first to third embodiments may be applied.

単語インデックス記憶部１０２の単語インデックスから情報を同義語ＩＤ毎に単語頻度情報を集約する際に任意で設定した条件を用いて、情報を取捨選択する。例としては、条件が、
条件：「文書内での単語頻度が１の情報は除外する」
であるとき、例えば、図３において同義語ＩＤ「００１」の集約を行うと
「アキバ：（１０，２）」「秋葉原：（１１，１）」⇒「００１：（１０，２）」
となる。 Information is selected from the word index stored in the word index storage unit 102 using a condition arbitrarily set when the word frequency information is aggregated for each synonym ID. For example, if the condition is
Condition: “Exclude information with a word frequency of 1 in the document”
For example, when the synonym ID “001” is aggregated in FIG. 3, “Akiba: (10, 2)” “Akihabara: (11, 1)” → “001: (10, 2)”
It becomes.

以下に、従来の同義語を追加するクエリ拡張手法（図５）と本発明の手法（図６）の比較を示す。 A comparison between the conventional query expansion method for adding synonyms (FIG. 5) and the method of the present invention (FIG. 6) is shown below.

情報検索装置３００から検索語『セール』が与えられた場合、従来の手法では、図５に示すように、検索エンジンは同義語テーブルを参照して、検索語を増やして（例えば、「特売」「バーゲン」）単語インデックスを検索し、各検索語に対応する単語出現頻度情報を得る。つまり、「セール」と当該単語の同義語「特売」「バーゲン」の３つの検索語を用いて、当該検索語毎に単語インデックスを参照して、それぞれの単語頻度情報を取得し、各単語毎に一致度を計算し（３回）、ドキュメント毎に一つの値となるように突合せ処理を行い、一致度に基づいてランキングして出力する。 When the search word “sale” is given from the information search device 300, in the conventional method, as shown in FIG. 5, the search engine refers to the synonym table and increases the search word (for example, “sale”). “Bargain”) The word index is searched to obtain word appearance frequency information corresponding to each search word. That is, using the three search terms “sale” and synonyms “sale” and “bargain” of the word, the word index is obtained for each word by referring to the word index for each word. The degree of coincidence is calculated (three times), matching processing is performed so that one document is obtained, and ranking is performed based on the degree of coincidence and output.

これに対し、本発明の手法は、図６に示すように、情報検索端末３００から検索語『セール』が与えられると、当該検索語『セール』に基づいて、単語インデックス記憶部１０２の単語インデックスを参照して同義語ＩＤを得る。そして、当該同義語ＩＤに基づいて同義語インデックス記憶部１０３の同義語インデックスを参照し、当該同義語ＩＤに対応する単語頻度情報を取得し、単語毎に一致度を計算し、突合せ処理をすることなく、その一致度に基づいてランキングする。 On the other hand, according to the method of the present invention, as shown in FIG. 6, when a search word “sale” is given from the information search terminal 300, the word index of the word index storage unit 102 is based on the search word “sale”. To obtain a synonym ID. Then, the synonym index of the synonym index storage unit 103 is referred to based on the synonym ID, word frequency information corresponding to the synonym ID is acquired, a matching degree is calculated for each word, and matching processing is performed. And ranking based on the degree of coincidence.

従来の手法では、単語インデックスを３つの単語分として３回参照しているのに対し、本発明では、単語インデックスを１回、同義語インデックスを１回の計２回参照すればよい。さらに、同義語インデックスから導出される単語出現頻度は１つの同義語ＩＤに対応する値のみであるので、一致度の突合せ処理が不要となる。 In the conventional technique, the word index is referred to three times for three words, whereas in the present invention, the word index is referred to once, and the synonym index is referred to twice, that is, a total of two times. Furthermore, since the word appearance frequency derived from the synonym index is only a value corresponding to one synonym ID, the matching process for matching is not required.

上記の第１〜第４の実施の形態における情報検索装置の構成要素の動作をプログラムとして構築し、情報検索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the constituent elements of the information search device in the first to fourth embodiments is constructed as a program, installed in a computer used as the information search device and executed, or distributed via a network. Is possible.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

１００情報検索装置
２００検索対象文書記憶装置
３００情報検索端末
１０１同義語テーブル記憶手段、同義語テーブル記憶部
１０２単語インデックス記憶手段、単語インデックス記憶部
１０３同義語インデックス記憶手段、同義語インデックス記憶部
１１０インデックス作成手段、インデックス機能部
１２０同義語インデックス作成手段、同義語インデックス機能部
１３０検索語一致度計算手段、検索語一致度計算部
１４０総合ランキング計算部 100 Information Search Device 200 Search Target Document Storage Device 300 Information Search Terminal 101 Synonym Table Storage Unit, Synonym Table Storage Unit 102 Word Index Storage Unit, Word Index Storage Unit 103 Synonym Index Storage Unit, Synonym Index Storage Unit 110 Index Creation means, index function section 120 Synonym index creation means, synonym index function section 130 Search term matching degree calculation means, Search term matching degree calculation section 140 General ranking calculation section

Claims

When searching for an electronic document group using a search word (word unit) instructed from a user terminal, an index that stores an electronic document that is a search range divided into arbitrary units (hereinafter referred to as “words”) is stored. An information search device that calculates a degree of matching with the search term by referring to
Synonym table storage means storing a synonym table having information (hereinafter referred to as “synonym ID”) in which words in synonym relations are grouped in advance and ID is assigned to each group;
Analyzing the search range, for each word, combining information such as the appearance frequency in each electronic document (hereinafter referred to as “word frequency information”) and the information of the synonym ID included in the word together with the record of the word An index creation means for creating a word index aggregated as
A synonym index creating means for creating a synonym index in which word frequency information for each word of the word index of the word index storage means is aggregated for each synonym ID, and storing the synonym index in the synonym index storage means;
Using the word frequency information obtained by referring to the synonym index storage means using the synonym ID obtained by referring to the word index storage means using the search word as a key and the degree of coincidence with the search word A search term matching degree calculating means to calculate,
An information retrieval apparatus comprising:

The search word matching degree calculation means includes:
When calculating the degree of coincidence with the search word, the word index storage means is referenced using the search word as a key, and in addition to the synonym ID of the word, word frequency information is also acquired, and the synonym index The information search apparatus according to claim 1, wherein the information search apparatus is used for calculating the degree of coincidence together with the word frequency information of the synonym ID obtained by referring to the storage unit.

The synonym index creating means includes:
The information is selected and stored in the synonym index storage unit as the synonym index using a condition arbitrarily set when the word frequency information is aggregated from the information of the word index storage unit. Information retrieval device.

An index in which an electronic document that is a search range is divided into arbitrary units (hereinafter referred to as “words”) and stored when an electronic document group is searched using a search word (word unit) specified by a user terminal An information search method for calculating the degree of coincidence with the search term with reference to
Synonym table storage means storing a synonym table having information (hereinafter referred to as “synonym ID”) in which words in synonym relations are grouped in advance and ID is assigned to each group;
Word index storage means for storing a word index;
Synonym index storage means for storing a synonym index;
A device having
Analyzing the search range, for each word, combining information such as the appearance frequency in each electronic document (hereinafter referred to as “word frequency information”) and the information of the synonym ID included in the word together with the record of the word An index creation step of creating a word index aggregated as and storing it in the word index storage means;
Creating a synonym index in which word frequency information for each word of the word index of the word index storage unit is aggregated for each synonym ID, and storing the synonym index in the synonym index storage unit;
Using the word frequency information obtained by referring to the synonym index storage means using the synonym ID obtained by referring to the word index storage means using the search word as a key and the degree of coincidence with the search word A search term matching degree calculating step to calculate,
An information retrieval method characterized by:

In the search term matching degree calculation step,
When calculating the degree of coincidence with the search word, the word index storage means is referenced using the search word as a key, and in addition to the synonym ID of the word, word frequency information is also acquired, and the synonym index The information search method according to claim 4, wherein the information search method is used for calculating the degree of coincidence together with the word frequency information of the synonym ID obtained by referring to the storage unit.

In the synonym index creation step,
5. The information is selected and stored in the synonym index storage unit as the synonym index using a condition arbitrarily set at the time of aggregation of each word frequency information from the information of the word index storage unit. Information retrieval method.

The information search program for functioning a computer as each means which comprises the information search device of any one of Claims 1 thru | or 3.