JP4759600B2

JP4759600B2 - Text search device, text search method, text search program and recording medium thereof

Info

Publication number: JP4759600B2
Application number: JP2008216556A
Authority: JP
Inventors: 眞哉村田; 浩之戸田; 由美子松浦; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-08-26
Filing date: 2008-08-26
Publication date: 2011-08-31
Anticipated expiration: 2028-08-26
Also published as: JP2010055164A

Description

本発明は、入力されたクエリ及び該クエリに関連する拡張語を用いてサイトの文章を検索する技術に関する。 The present invention relates to a technique for searching a sentence on a site using an input query and an extended word related to the query.

文章検索システムのランキングの精度向上を図るため、有望な手法の一つに「クエリ拡張」と呼ばれるものがある。これは、クエリ（検索語）に対して「何らかの関連ある」語（以下、拡張語と呼ぶ）を自動で取得し、選択し、付与することでより良い検索結果を導き出そうとする手法である。 In order to improve the accuracy of the ranking of the text search system, one promising technique is called “query expansion”. This is a method of automatically obtaining, selecting, and assigning “something related” words (hereinafter referred to as extended words) to a query (search term) to derive better search results. is there.

これに関連する文献として、非特許文献１では拡張語の取得先データとしてクリックログを、選択基準としてクエリとの共起確率を用いている。このクリックログは、クエリに対する検索結果中においてウェブの閲覧者が実際に選択したサイト（ｃｌｉｃｋｅｄｄｏｃｕｍｅｎｔ）のＵＲＬの集合である。クエリに適する拡張語というのは時々刻々変化すると考えるのが自然であり、非特許文献１ではクリックログを用いてこれに対処している。
“ＰｒｏｂａｂｉｌｉｓｔｉｃＱｕｅｒｙＥｘｐａｎｓｉｏｎＵｓｉｎｇＱｕｅｒｙＬｏｇｓ”．ＨａｎｇＣｕｉ，ｅｔａｌ．（２００２）Ｓ．Ｅ．Ｒｏｂｅｒｔｓｏｎ，“Ｏｎｔｅｒｍｓｅｌｅｃｔｉｏｎｆｏｒｑｕｅｒｙｅｘｐａｎｓｉｏｎ”，ＪｏｕｒｎａｌｏｆＤｏｃｕｍｅｎｔａｔｉｏｎ，４６，ｐａｇｅｓ３５９−３６４，１９９０． As a document related to this, Non-Patent Document 1 uses a click log as acquisition source data of an extended word and a co-occurrence probability with a query as a selection criterion. The click log is a set of URLs of sites (clicked documents) that are actually selected by the web viewer in the search result for the query. Naturally, it is natural that an extended word suitable for a query changes from time to time. In Non-Patent Document 1, this is dealt with by using a click log.
“Probabilistic Query Expansion Usage Query Logs”. Hang Cui, et al. (2002) S. E. Robertson, “On term selection for query expansion”, Journal of Document, 46, pages 359-364, 1990.

非特許文献１では、クリックログを解析して得られる単純なクリック回数（絶対的クリック回数）を用いて閲覧者が選択したサイト（ｃｌｉｃｋｅｄｄｏｃｕｍｅｎｔ）のＵＲＬを判別し、そのタイトルと概要文（スニペット）内に含まれるキーワードとクエリとの共起確率を計算している。そして、この尺度が高いキーワードから順に拡張語として選択し、クエリ拡張を実行している。 In Non-Patent Document 1, the URL of a site (clicked document) selected by a viewer is determined using a simple click count (absolute click count) obtained by analyzing a click log, and its title and summary sentence (snippet) ) To calculate the co-occurrence probabilities of keywords and queries included in Query expansion is performed by selecting the expanded words in order from the keyword with the highest measure.

しかしながら、この絶対的クリック回数には、検索結果中の上位ランクにあるサイト程よくクリックされるという傾向が含まれており、この値が高いサイトが閲覧者に数多く選択され、クエリに適合していると判断されたとは必ずしも言えない。また、非特許文献１では、拡張語数が４０〜６０個でランキングの精度が最大になっており、これでは計算コストがかかってしまう。 However, this absolute click count includes a tendency to click more frequently in the higher ranking sites in the search results, and many sites with this higher value are selected by the viewer and fit the query. It cannot be said that it was judged. Further, in Non-Patent Document 1, the number of extended words is 40 to 60, and the ranking accuracy is maximized, which requires a calculation cost.

そこで本発明は、このような問題に鑑み、クリックログを解析することでアクセスの集中するサイトを的確に特定し、少ない拡張語数で大幅な検索精度の向上を可能にするクエリ拡張を実現することを解決課題としている。 Therefore, in view of such problems, the present invention realizes query expansion that accurately identifies a site where access is concentrated by analyzing a click log, and can greatly improve search accuracy with a small number of expanded words. Is a solution issue.

本発明は、前記課題を解決するために創作された技術的思想であって、多くの閲覧者が有用だと判断したアクセスの集中するサイトのタイトルと概要文（スニペット）を拡張語の取得源とみなすことにより、クエリに対する高い適合性を持った拡張語の取得を可能にしている。 The present invention is a technical idea created in order to solve the above-mentioned problems, and obtains an extension word from a title and a summary sentence (snippet) of a site where access is concentrated that many viewers find useful. As a result, it is possible to obtain extended words with high suitability for queries.

具体的には、請求項１記載の発明は、入力されたクエリに関連する拡張語を取得し、該拡張語と前記クエリとを用いてサイトを検索する文章検索装置であって、閲覧者の実際に選択したサイトがクリック回数に基づきランク付けされた前記クエリに対するクリックログを解析して、アクセスの集中するサイトを特定するクリックログ解析手段と、前記アクセスの集中するサイトのタイトルと概要文とを解析して、前記拡張語を取得する拡張語取得手段と、を備え、前記クリックログ解析手段は、前記クリックログ中の隣接ランクに存在するサイト間のクリック回数差を相対的クリック回数として算出し、該相対的クリック回数に応じてサイトのアクセス集中度合を求める第１解析手段と、前記検索ごとに各ランクのサイトに対するクリック確率が保存されたデータベースを参照して、ランクの平均クリック確率に対する実際のクリック回数を生起確率として算出し、該生起確率と閾値とを用いてアクセス集中サイトの候補を求める第２解析手段と、前記両解析手段の解析結果を統合してアクセスの集中するサイトを特定する解析結果統合手段と、を有することを特徴としている。 Specifically, the invention described in claim 1 is a sentence search device that acquires an extended word related to an input query and searches a site using the extended word and the query . Click log analysis means for analyzing the click log for the query in which the actually selected site is ranked based on the number of clicks to identify the site where access is concentrated, the title and summary sentence of the site where access is concentrated, And an extended word acquisition unit that acquires the extended word, and the click log analysis unit calculates a difference in the number of clicks between sites existing in adjacent ranks in the click log as a relative number of clicks. and a first analyzing means for determining the access degree of concentration of the site in accordance with the relative number of clicks, click probability for each rank sites for each of the search There Referring to saved database, to calculate the actual number of clicks to the average click probability rank as occurrence probability, a second analyzing means for determining a candidate access concentration site by using a biological cause probability and threshold, the And an analysis result integration unit that integrates the analysis results of both analysis units and identifies a site where access is concentrated.

また、請求項２記載の発明は、前記拡張語取得手段が、前記アクセスの集中するサイトのタイトルと概要文とを解析して拡張語の候補群を求める手段と、前記拡張語の各候補に対して順序付けを行う手段と、前記順序付けられた各候補の順位に基づいて拡張語を選択する手段とを有することを特徴としている。 According to the second aspect of the present invention, the extended word acquisition means analyzes the title and summary sentence of the site where access is concentrated to obtain a candidate group of extended words; It is characterized by comprising means for ordering and means for selecting an extension word based on the ranking of each of the ordered candidates.

また、請求項３記載の発明は、前記クエリと前記拡張語とを用いてサイトを検索し、検索結果を出力する検索実行手段と、前記検索実行手段の検索結果に対する利用者のクリック情報を前記クリックログに反映させるクリックログフィードバック処理手段とをさらに備えることを特徴としている。 The invention according to claim 3 searches a site using the query and the extended word, and outputs search results, search execution means for outputting search results, and user click information for the search results of the search execution means. It further comprises click log feedback processing means for reflecting in the click log.

また、請求項４記載の発明は、入力されたクエリに関連する拡張語を取得し、該拡張語と前記クエリとを用いてサイトを検索する文章検索方法であって、クリックログ解析手段が、閲覧者の実際に選択したサイトがクリック回数に基づきランク付けされた前記クエリに対するクリックログを解析してアクセスの集中するサイトを特定する第１ステップと、拡張語取得手段が、前記アクセスの集中するサイトのタイトルと概要文とを解析して前記拡張語を取得する第２ステップと、を有し、前記第１ステップは、前記クリックログ中の隣接ランクに存在するサイト間のクリック回数差を相対的クリック回数として算出し、該相対的クリック回数に応じてサイトのアクセス集中度合を求めるステップと、前記検索ごとに各ランクのサイトに対するクリック確率が保存されたデータベースを参照して、ランクの平均クリック確率に対する実際のクリック回数を生起確率として算出し、該生起確率と閾値とを用いてアクセス集中サイトの候補を求めるステップと、前記両ステップの解析結果を統合してアクセスの集中するサイトを特定するステップとを有することを特徴としている。 The invention according to claim 4 is a sentence search method for acquiring an extended word related to an input query and searching a site using the extended word and the query, wherein the click log analyzing means includes: A first step of analyzing a click log for the query in which a site actually selected by a viewer is ranked based on the number of clicks to identify a site where access is concentrated, and an expanded word acquisition unit concentrates the access. A second step of analyzing the title of the site and the summary sentence to obtain the extended word, wherein the first step is based on a relative difference in the number of clicks between sites existing in adjacent ranks in the click log. calculated as click-number, and obtaining access concentration degree of site in accordance with the relative number of clicks, for each rank of the site for each of the search With reference to database Rick probability is stored, calculates the actual number of clicks to the average click probability rank as occurrence probability, and determining a candidate access concentration site by using a biological cause probability and threshold, said both And integrating the analysis results of the steps to identify sites where access is concentrated.

また、請求項５記載の発明は、前記第２ステップが、前記アクセスの集中するサイトのタイトルと概要文とを解析して拡張語の候補群を求めるステップと、前記拡張語の各候補に対して順序付けを行うステップと、前記順序付けられた各候補の順位に基づいて拡張語を選択するステップとを有することを特徴としている。 The invention according to claim 5 is characterized in that the second step analyzes the title and summary sentence of the site where the access is concentrated to obtain an extended word candidate group, and for each of the extended word candidates, And ordering, and selecting expanded words based on the ranking of the ordered candidates.

また、請求項６記載の発明は、検索実行手段が、前記クエリと前記拡張語とを用いてサイトを検索し検索結果を出力するステップと、クリックログフィードバック処理手段が、前記ステップの検索結果に対する利用者のクリック情報を前記クリックログに反映させるステップとをさらに有することを特徴としている。 The invention according to claim 6 is the step in which the search execution means searches the site using the query and the extended word and outputs the search result, and the click log feedback processing means responds to the search result in the step. A step of reflecting user click information in the click log.

また、請求項７記載の発明は、文章検索プログラムであり、請求項４〜６のいずれか１項に記載の文章検索方法の各ステップをコンピュータに実行させることを特徴としている。 The invention described in claim 7 is a text search program, which causes a computer to execute each step of the text search method according to any one of claims 4 to 6 .

また、請求項８記載の発明は、コンピュータの読み取り可能な記録媒体であり、請求項７記載の文章検索プログラムを記録したことを特徴としている。 The invention described in claim 8 is a computer-readable recording medium, wherein the sentence search program described in claim 7 is recorded.

請求項１〜８記載の発明によれば、アクセスの集中するサイトのタイトルと概要文（スニペット）から拡張語が取得されることから、クエリに対する高い適合性を持った拡張語が取得可能になる。これにより、拡張語の数を低減でき、計算コストを抑えつつ高精度の検索結果を得ることができる。 According to the first to eighth aspects of the present invention, since the extended word is acquired from the title and summary sentence (snippet) of the site where access is concentrated, it becomes possible to acquire the extended word having high adaptability to the query. . Thereby, the number of extended words can be reduced, and a highly accurate search result can be obtained while suppressing the calculation cost.

また、二つの独立した尺度の解析結果を組み合わせていることから、アクセスの集中するサイトを的確に特定することができる。 In addition, since the analysis results of two independent measures are combined, it is possible to accurately identify the site where access is concentrated.

また、請求項３．６記載の発明によれば、検索結果に対する利用者の判断（クリック）をクリックログへ随時反映させることができる。 Further, according to the invention described in claim 3.6, it is possible to reflect the user's judgment (click) on the search result to the click log as needed.

図１は、本発明の実施形態に係る文章検索装置１を示している。ここでは前記文章検索装置１がコンピュータにより構成された例を説明するが、文章検索装置はこれに限定されるものではなく、例えば文章検索の処理ロジックを実装したＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）チップを備えた計算機などでもよく、また携帯電話などのモバイル端末などでもよい。 FIG. 1 shows a text search apparatus 1 according to an embodiment of the present invention. Here, an example in which the text search device 1 is configured by a computer will be described. However, the text search device is not limited to this, and includes, for example, an IC (Integrated Circuit) chip on which text search processing logic is mounted. It may be a computer or a mobile terminal such as a mobile phone.

前記文章検索装置１は、図１に示すように、主な４つの機能ブロック、即ちクエリ入力画面１０１および結果表示画面１０２を表示する表示部１００と、入力されたクエリ及び該クエリに関連する拡張語とを用いて検索式を生成する検索式生成部１１０と、入力されたクエリに対する拡張語を取得する拡張語選択部１２０と、前記検索式を実行する検索実行部１４０とを備えている。 As shown in FIG. 1, the text search apparatus 1 includes a display unit 100 that displays four main functional blocks, that is, a query input screen 101 and a result display screen 102, an input query, and an extension related to the query. A search expression generation unit 110 that generates a search expression using words, an extended word selection unit 120 that acquires an expansion word for an input query, and a search execution unit 140 that executes the search expression are provided.

前記各機能ブロック１００．１１０．１２０．１４０の機能は、前記文章検索装置１の制御部（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｏｒＵｎｉｔ）が文章検索プログラムを読み込んで実現されている。また、前記文章検索装置１は、コンピュータの通常の構成要素、例えば図示省略のキーボードやマウスなどの入力部と、処理データなどを一時記憶する書き換え可能なメモリ（ＲＡＭ）と、ネットワーク接続に使用する通信デバイスと、ハードディスクドライブ装置などの記憶部と、ディスプレイなどの表示部とを備えている。以下、前記各機能ブロック１００．１１０．１２０．１４０について図１に基づき詳細に説明する。 The functions of the respective functional blocks 100.110.120.140 are realized by reading a text search program by a control unit (CPU: Central Processor Unit) of the text search device 1. The text search device 1 is used for normal components of a computer, for example, an input unit such as a keyboard and a mouse (not shown), a rewritable memory (RAM) for temporarily storing processing data, and a network connection. A communication device, a storage unit such as a hard disk drive, and a display unit such as a display are provided. Hereinafter, each functional block 100.110.120.140 will be described in detail with reference to FIG.

＜表示部１００＞
前記表示部１００には、利用者がクエリ（検索語）を入力するクエリ入力画面１０１および前記検索実行部１４０から得た検索結果を表示する結果表示画面１０２がブラウザを介して表示される。ここでは、利用者は前記クエリ入力画面１０１にて前記キーボードなどを用いてクエリを入力する。 <Display unit 100>
On the display unit 100, a query input screen 101 for a user to input a query (search term) and a result display screen 102 for displaying a search result obtained from the search execution unit 140 are displayed via a browser. Here, the user inputs a query on the query input screen 101 using the keyboard or the like.

＜検索式生成部１１０＞
前記検索式生成部１１０は、前記クエリ入力画面１０１において入力された前記クエリを受信し、受信したクエリおよび該クエリに対する拡張語の要求を前記拡張語選択部１２０へ送信する。また、前記拡張語選択部１２０から拡張語を受信すると、この拡張語を用いて初期のクエリに対する検索結果を並び替える処理を行う検索式を生成し、これを前記検索実行部１４０へ送信する。この検索式を用いて、入力されたクエリに対する初回の検索結果のランキングを、前記拡張語を用いてより精度の高いランキングに並び替えている。 <Search Expression Generation Unit 110>
The search expression generation unit 110 receives the query input on the query input screen 101, and transmits the received query and an extended word request for the query to the extended word selection unit 120. In addition, when an extended word is received from the extended word selection unit 120, a search expression for performing a process of rearranging search results for an initial query is generated using the extended word, and is transmitted to the search execution unit 140. Using this search formula, the ranking of the initial search result for the input query is rearranged into a more accurate ranking using the extended word.

＜拡張語選択部１２０＞
前記拡張語選択部１２０は、命令部１２１、解析結果統合部１２２、解析部Ａ１２３、解析部Ｂ１２４、解析部Ｃ１２５、情報抽出部１２８、照合部１２９、検索結果取得部１３０、機能語抽出・名詞句生成部１３３、拡張語の重み・順序付け部１３４、クリックログフィードバック処理部１５０と、３つのＤＢ（データベース）、即ちクリックログＤＢ１２６、クリック確率ＤＢ１２７、インデックスＤＢ１３２と、検索エンジン１３１とを有している。このうち前記各ＤＢ１２６．１２７．１３２は、前記ハードディスクドライブ装置上に構築されている。 <Extended word selection unit 120>
The extended word selection unit 120 includes an instruction unit 121, an analysis result integration unit 122, an analysis unit A123, an analysis unit B124, an analysis unit C125, an information extraction unit 128, a collation unit 129, a search result acquisition unit 130, a function word extraction / noun. The phrase generation unit 133, the extended word weight / ordering unit 134, the click log feedback processing unit 150, three DBs (databases), that is, the click log DB 126, the click probability DB 127, the index DB 132, and the search engine 131 are included. Yes. Of these, each of the DBs 126.127.132 is constructed on the hard disk drive device.

前記拡張語選択部１２０は、前記検索式生成部１１０から前記クエリと拡張語の要求を受信すると、前記命令部１２１に前記クエリを送信する。 When the extended word selection unit 120 receives the query and the extended word request from the search expression generation unit 110, the extended word selection unit 120 transmits the query to the command unit 121.

前記命令部１２１は、前記拡張語選択部１２０から前記クエリを受信すると、受信したクエリとクリックログの解析要求とを前記解析結果統合部１２２へ送信する。 When receiving the query from the extended word selection unit 120, the command unit 121 transmits the received query and a click log analysis request to the analysis result integration unit 122.

前記解析結果統合部１２２は、前記命令部１２１から前記クエリとクリックログの解析要求を受信すると、これらを前記解析部Ａ１２３および前記解析部Ｂ１２４へ送信する。 When the analysis result integration unit 122 receives the query and click log analysis request from the command unit 121, the analysis result integration unit 122 transmits them to the analysis unit A123 and the analysis unit B124.

前記解析部Ａ１２３および前記解析部Ｂ１２４は、前記解析結果統合部１２２から前記クエリとクリックログの解析要求を受信すると、前記クリックログＤＢ１２６から前記クエリに対応するクリックログを読み出す。前記クリックログＤＢ１２６には、クエリに対する検索結果中において、閲覧者が実際に選択したサイト（ｃｌｉｃｋｅｄｄｏｃｕｍｅｎｔ）のＵＲＬがクリック回数に基づきランク付けされ、これがクリックログとして格納されている。このクリックログＤＢ１２６は、図外のサーバからインターネット経由で予め取得したクリックログのデータを用いて構築される。 When the analysis unit A123 and the analysis unit B124 receive the query and click log analysis request from the analysis result integration unit 122, the analysis unit A123 and the analysis unit B124 read the click log corresponding to the query from the click log DB 126. In the click log DB 126, URLs of sites (clicked documents) actually selected by the viewer in the search result for the query are ranked based on the number of clicks, and stored as a click log. The click log DB 126 is constructed using click log data acquired in advance from a server (not shown) via the Internet.

前記解析部Ａ１２３は、読み出したクリックログをサイト毎のアクセス集中度合に着目して解析し、解析結果を前記解析結果統合部１２２へ返信する。 The analysis unit A123 analyzes the read click log by paying attention to the degree of access concentration for each site, and returns the analysis result to the analysis result integration unit 122.

前記解析部Ｂ１２４はさらに、前記クリック確率ＤＢ１２７に保存されているランク毎のサイトのクリック確率を読み出し、前記両ＤＢ１２６．１２７から読み出したクリックログとクリック確率とを解析して、解析結果を前記解析結果統合部１２２へ返信する。なお、前記クリック確率ＤＢ１２７も、前記クリックログＤＢ１２６と同様に前記サーバ上の前記クリックログのデータを用いて構築される。 The analysis unit B124 further reads the click probability of the site for each rank stored in the click probability DB 127, analyzes the click log and the click probability read from both the DBs 126.127, and analyzes the analysis result as the analysis. It returns to the result integration unit 122. The click probability DB 127 is also constructed using the click log data on the server in the same manner as the click log DB 126.

前記解析結果統合部１２２は、前記解析部Ａ１２３および前記解析部Ｂ１２４から各解析結果を受信すると、それらを統合してアクセス集中サイト（ＡｃｃｅｓｓＣｏｎｃｅｎｔｒａｔｉｏｎＳｉｔｅｓ：以下、ＡＣＳと略す）を特定し、そのＵＲＬを前記命令部１２１に返信する。アクセス集中サイト（ＡＣＳ）の詳細については後述する。 When the analysis result integration unit 122 receives the analysis results from the analysis unit A123 and the analysis unit B124, the analysis result integration unit 122 integrates them to specify an access concentration site (hereinafter abbreviated as ACS), and the URL Is sent back to the command unit 121. Details of the access concentration site (ACS) will be described later.

前記命令部１２１は、前記解析結果統合部１２２からアクセス集中サイト（ＡＣＳ）のＵＲＬを受信すると、このＵＲＬを前記情報抽出部１２８へ送信する。 Upon receiving the URL of the access concentration site (ACS) from the analysis result integration unit 122, the command unit 121 transmits this URL to the information extraction unit 128.

前記情報抽出部１２８は、前記命令部１２１から前記クエリとアクセス集中サイト（ＡＣＳ）のＵＲＬを受信すると、これらを前記照合部１２９へ送信する。 When the information extraction unit 128 receives the query and the URL of the access concentration site (ACS) from the command unit 121, the information extraction unit 128 transmits them to the verification unit 129.

前記照合部１２９は、前記情報抽出部１２８から前記クエリとアクセス集中サイト（ＡＣＳ）のＵＲＬを受信すると、前記クエリを前記検索結果取得部１３０へ送信する。 When the collation unit 129 receives the query and the URL of the access concentrated site (ACS) from the information extraction unit 128, the collation unit 129 transmits the query to the search result acquisition unit 130.

前記検索結果取得部１３０は、前記照合部１２９から前記クエリを受信すると、このクエリを前記検索エンジン１３１へ投入する。 When the search result acquisition unit 130 receives the query from the collation unit 129, the search result acquisition unit 130 inputs the query to the search engine 131.

前記検索エンジン１３１は、前記検索結果取得部１３０もしくは前記検索実行部１４０から検索結果の要求を受信すると、前記インデックスＤＢ１３２を検索した検索結果を返信する。 When the search engine 131 receives a search result request from the search result acquisition unit 130 or the search execution unit 140, the search engine 131 returns a search result obtained by searching the index DB 132.

前記インデックスＤＢ１３２には、「ＷｏｒｌｄＷｉｄｅＷｅｂ」もしくは「ＭｏｂｉｌｅＷｅｂ」のサイトが各々インデックス加工されて保存されている。この前記インデックスＤＢ１３２は、予め図外のサーバからインターネット経由で取得したデータを用いて構築される。ここでは、前記検索エンジン１３１が前記インデックスＤＢ１３２を備えているものとする。 In the index DB 132, “World Wide Web” or “Mobile Web” sites are indexed and stored. The index DB 132 is constructed using data acquired in advance from a server (not shown) via the Internet. Here, it is assumed that the search engine 131 includes the index DB 132.

前記検索結果取得部１３０は、前記検索エンジン１３１から前記クエリに対する検索結果を受信すると、該検索結果を前記照合部１２９へ返信する。 When the search result acquisition unit 130 receives a search result for the query from the search engine 131, the search result acquisition unit 130 returns the search result to the collation unit 129.

前記照合部１２９は、前記検索結果取得部１３０から前記クエリに対する検索結果を受信すると、該検索結果のＵＲＬと前記情報抽出部１２８から受信したアクセス集中サイト（ＡＣＳ）のＵＲＬとを照合し、該検索結果のＵＲＬからアクセス集中サイト（ＡＣＳ）を識別する。そして、識別したアクセス集中サイト（ＡＣＳ）の情報（ＵＲＬ、タイトルなど）を前記情報抽出部１２８へ返信する。 Upon receiving the search result for the query from the search result acquisition unit 130, the collation unit 129 collates the URL of the search result with the URL of the access concentration site (ACS) received from the information extraction unit 128, and The access concentrated site (ACS) is identified from the URL of the search result. Then, information (URL, title, etc.) of the identified access concentration site (ACS) is returned to the information extraction unit 128.

前記情報抽出部１２８は、前記照合部１２９からアクセス集中サイト（ＡＣＳ）の情報（ＵＲＬ、タイトルなど）を受信すると、この情報からタイトルとスニペット（ＴｉｔｌｅｓａｎｄＳｎｉｐｐｅｔｓ）を抽出し、これらを前記機能語抽出・名詞句生成部１３３へ送信する。 When the information extraction unit 128 receives the information (URL, title, etc.) of the access concentration site (ACS) from the collation unit 129, the information extraction unit 128 extracts a title and a snippet from the information and uses them as the function word. This is transmitted to the extraction / noun phrase generator 133.

前記機能語抽出・名詞句生成部１３３は、前記情報抽出部１２８からアクセス集中サイト（ＡＣＳ）のタイトルとスニペットを受信すると、これらを形態素解析し、機能語の抽出もしくは名詞句を生成する。そして、これらを後のクエリ拡張で用いる拡張語の候補群とし、前記拡張語の重み・順序付け部１３４へ送信する。 When the function word extraction / noun phrase generation unit 133 receives the title and snippet of the access concentration site (ACS) from the information extraction unit 128, the function word extraction / noun phrase generation unit 133 performs morphological analysis on the title and snippet to generate a function word extraction or noun phrase. Then, these are used as a candidate group of extension words to be used in the subsequent query extension, and transmitted to the extension word weight / ordering unit 134.

前記拡張語の重み・順序付け部１３４は、前記機能語抽出・名詞句生成部１３３から拡張語の候補群を受信すると、非特許文献２の「ＲｏｂｅｒｔｓｏｎＳｅｌｅｃｔｉｏｎＶａｌｕｅ（ＲＳＶ）」に基づいて各拡張語の候補群の重み付け、順序付けを行い、この結果を前記拡張語選択部１２０へ送信する。 When the extension word weight / ordering unit 134 receives the extension word candidate group from the function word extraction / noun phrase generation unit 133, each extension word is based on “Robertson Selection Value (RSV)” of Non-Patent Document 2. The candidate groups are weighted and ordered, and the result is transmitted to the extended word selection unit 120.

前記拡張語選択部１２０は、前記拡張語の重み・順序付け部１３４から拡張語の候補群を受信すると、この候補群から実際に使用する拡張語を選択し、これらを前記検索式生成部１１０へ返信する。 When the extended word selection unit 120 receives the extended word candidate group from the extended word weight / ordering unit 134, the extended word selection unit 120 selects an extended word to be actually used from the candidate group, and supplies these to the search expression generation unit 110. Send back.

前記解析部Ｃ１２５は、前記クリックログＤＢ１２６から全てのクリックログを読み出して各ランクのサイトに対するクリック確率を算出し、この算出結果を用いて前記クリック確率ＤＢ１２７を更新する。この更新処理は予め設定された一定時間毎に実行される。 The analysis unit C125 reads all click logs from the click log DB 126, calculates click probabilities for the sites of each rank, and updates the click probability DB 127 using the calculation results. This update process is executed at predetermined time intervals.

前記クリックログフィードバック処理部１５０は、前記結果表示画面１０２に表示された検索結果に対して利用者がクリックしたサイトの情報（クリック情報）を記録し、これを基に新たなクリックログを生成して前記クリックログＤＢ１２６へ随時フィードバックさせる。 The click log feedback processing unit 150 records information (click information) of a site clicked by a user on the search result displayed on the result display screen 102, and generates a new click log based on the information. Feedback to the click log DB 126 as needed.

＜検索実行部１４０＞
前記検索実行部１４０は、前記検索式生成部１１０から検索式を受信すると、これを前記検索エンジン１３１へ投入し、対応する検索結果を受信する。そして、この検索結果を前記結果表示画面１０２へ表示する。 <Search Execution Unit 140>
When the search execution unit 140 receives a search expression from the search expression generation unit 110, the search execution unit 140 inputs the search expression into the search engine 131 and receives a corresponding search result. Then, the search result is displayed on the result display screen 102.

＜動作例＞
前記文章検索装置１は、利用者から入力されたクエリを基にクリックログを解析して拡張語を取得し、この拡張語を用いてクエリ拡張をすることによりさらに精度の高い検索結果を得ている。この一連の処理は主に４つのフェーズ、即ちクエリ入力フェーズ、クリックログ解析フェーズ、拡張語取得フェーズ、検索実行フェーズから構成されている。以下、この各フェーズの詳細な処理内容について、図２〜７に基づき説明する。 <Operation example>
The text search apparatus 1 analyzes the click log based on the query input from the user, acquires an extended word, and obtains a more accurate search result by expanding the query using the extended word. Yes. This series of processing mainly includes four phases, that is, a query input phase, a click log analysis phase, an extended word acquisition phase, and a search execution phase. Hereinafter, the detailed processing content of each phase will be described with reference to FIGS.

（１）クエリ入力フェーズ
図２は、クエリ入力フェーズの処理フローを示している。まず、利用者は前記クエリ入力画面１０１においてクエリを入力する。入力されたクエリは、前記検索式生成部１１０へ送信される。前記検索式生成部１１０は、前記拡張語選択部１２０へ前記クエリとともに拡張語の抽出・選択要求を送信する。 (1) Query Input Phase FIG. 2 shows a processing flow of the query input phase. First, the user inputs a query on the query input screen 101. The input query is transmitted to the search expression generation unit 110. The search expression generation unit 110 transmits an extended word extraction / selection request together with the query to the extended word selection unit 120.

（２）クリックログ解析フェーズ
クリックログ解析フェーズでは、入力されたクエリを用いてクリックログを解析する。この目的は、クリックログ中のサイトにおいてアクセスの集中するサイトを特定することにある。なぜなら閲覧者は、検索結果中のサイトを選択する（クリックする）際、そのサイトのタイトルとスニペット（概要文）を見て判断すると考えられ、アクセスの集中するサイトのタイトルとスニペットには、閲覧者が有用だと判断したキーワードがあると期待できるからである。そして、このキーワードでクエリ拡張をすることにより、大幅な検索の精度向上が望めると考えられる。ここではそのようなサイトをアクセス集中サイト（ＡＣＳ）とし、そのタイトルとスニペット（ＴｉｔｌｅｓａｎｄＳｎｉｐｐｅｔｓ）をＴＳと呼ぶ。クリックログ解析フェーズでは、入力されたクエリを用いてクリックログを解析することにより、アクセス集中サイト（ＡＣＳ）の特定を行っている。 (2) Click log analysis phase In the click log analysis phase, click logs are analyzed using the input query. The purpose is to identify sites where access is concentrated among the sites in the click log. This is because when a viewer selects (clicks) a site in the search results, it is considered that the viewer looks at the title and snippet (summary) of the site. This is because it can be expected that there are keywords that the person has judged useful. And by expanding the query with this keyword, it can be expected that the accuracy of the search is greatly improved. Here, such a site is called an access concentrated site (ACS), and its title and snippet (Titles and Snippets) are called TS. In the click log analysis phase, an access concentrated site (ACS) is specified by analyzing the click log using the input query.

図３は、クリックログ解析フェーズの処理フローを示している。前記命令部１２１は、前記拡張語選択部１２０から前記クエリを受信すると（図２の記号Ａから続く）、受信した前記クエリとそれに対するクリックログの解析要求を前記解析結果統合部１２２へ送信する。前記解析結果統合部１２２は、この要求を受けると、前記解析部Ａ１２３と前記解析部Ｂ１２４へ前記クエリを送信し、解析フェーズが開始される。 FIG. 3 shows a processing flow of the click log analysis phase. When the command unit 121 receives the query from the extended word selection unit 120 (following the symbol A in FIG. 2), the command unit 121 transmits the received query and a click log analysis request thereto to the analysis result integration unit 122. . Upon receiving this request, the analysis result integration unit 122 transmits the query to the analysis unit A123 and the analysis unit B124, and an analysis phase is started.

前記解析部Ａ１２３は、アクセス集中サイト（ＡＣＳ）を特定するための指標となるアクセス集中度合（ＡＣＤ）を算出する。具体的には、前記解析部Ａ１２３は、前記クリックログＤＢ１２６から前記クエリに対するクリックログを読み出す。そして、読み出したクリックログのあるランクに存在するサイトと、その両隣のランクに存在するサイトのクリック回数に着目し、その相対的クリック回数を式（１）および式（２）により算出する。この式（１）および式（２）は、前記文章検索装置１のプログラムに定義されているものとする。 The analysis unit A123 calculates an access concentration degree (ACD) as an index for specifying an access concentration site (ACS). Specifically, the analysis unit A123 reads a click log for the query from the click log DB 126. Then, paying attention to the number of clicks of a site existing in a rank of the read click log and the sites existing in both adjacent ranks, the relative number of clicks is calculated by Expression (1) and Expression (2). These expressions (1) and (2) are defined in the program of the text search apparatus 1.

ここで、ｃ（ｑ，ｒ）はクエリｑに対するクリックログ中でランクｒとなったサイトのクリック回数を示し、ｃ（ｑ，ｒ−１）、ｃ（ｑ，ｒ＋１）は、ランクｒの左隣のランクｒ−１、右隣のランクｒ＋１となったサイトのクリック回数を示す。 Here, c (q, r) indicates the number of clicks of the site ranked r in the click log for the query q, and c (q, r-1) and c (q, r + 1) are the left of rank r. It shows the number of clicks on the site that has the next rank r−1 and the right next rank r + 1.

また、ｓｌｏｐｅ_Lおよびｓｌｏｐｅ_Rは、クリックログをあるクエリｑに対して解析し、ランクとクリック回数に基づいて曲線を描いたときのランクｒ−１およびランクｒ＋１に対するそれぞれの傾きに対応する。この曲線の例を図４に示す。 Further, slope _L and slope _R correspond to respective slopes for rank r-1 and rank r + 1 when a click log is analyzed for a certain query q and a curve is drawn based on the rank and the number of clicks. An example of this curve is shown in FIG.

図４において、横軸はサイトのランク、縦軸はクリック回数を示している。ここで、特定クエリ曲線はクリックログをあるクエリに対して解析して描いた曲線、平均クエリ曲線はクリックログを全てのクエリに対して解析し、そのクエリの個数で平均して描いた曲線を表す。 In FIG. 4, the horizontal axis indicates the rank of the site, and the vertical axis indicates the number of clicks. Here, the specific query curve is a curve drawn by analyzing the click log for a query, and the average query curve is a curve drawn by analyzing the click log for all queries and averaging the number of queries. To express.

このとき、特定クエリ曲線の傾きが急になる、即ち特定クエリ曲線が強いピークを描いているランクにあるサイトをアクセスの集中するサイトと想定し、これをアクセス集中サイト（ＡＣＳ）の候補とみなす。この曲線の傾きの程度を、式（３）でアクセス集中度合ＡＣＤ（ｑ，ｒ）として定義する。この式（３）は、前記文章検索装置１のプログラムに定義されているものとする。 At this time, it is assumed that the site where the slope of the specific query curve is steep, that is, the rank where the specific query curve has a strong peak is a site where access is concentrated, and this is regarded as a candidate for an access concentrated site (ACS). . The degree of the slope of this curve is defined as the access concentration degree ACD (q, r) in Expression (3). This expression (3) is defined in the program of the text search apparatus 1.

ここでθ_L（ｒ）、θ_R（ｒ）は傾きｓｌｏｐｅ_L、ｓｌｏｐｅ_Rに対する角度を示しており、アクセス集中度合ＡＣＤ（ｑ，ｒ）はこの角度により特徴付けられる。前記解析部Ａ１２３は、アクセス集中度合ＡＣＤ（ｑ，ｒ）をサイト毎に算出し、解析結果として前記解析結果統合部１２２へ返信する。 Here, θ _L (r) and θ _R (r) indicate angles with respect to the slopes slope _L and slope _R, and the access concentration degree ACD (q, r) is characterized by these angles. The analysis unit A123 calculates the access concentration degree ACD (q, r) for each site, and returns it to the analysis result integration unit 122 as an analysis result.

前記解析部Ｂ１２４は、アクセス集中サイト（ＡＣＳ）を特定するための別の指標として、クリックログ中のサイトのクリック回数が明らかに多いかどうかを統計的に判断する。即ち、あるランクｒに存在するサイトのクリック回数が、そのランクｒが期待するクリック回数（平均クリック回数）を大きく上回っていれば、そのサイトは偶然でなく閲覧者に意図的に選択されたといえる。 The analysis unit B124 statistically determines whether or not the number of clicks of the site in the click log is obviously large as another index for specifying the access concentrated site (ACS). That is, if the number of clicks of a site existing in a certain rank r greatly exceeds the number of clicks (average number of clicks) expected by that rank r, it can be said that the site was intentionally selected by the viewer, not by chance. .

具体的には、前記解析部Ｂ１２４は、前記解析結果統合部１２２から前記クエリｑとクリックログの解析要求を受けると、前記クリックログＤＢ１２６から前記クエリｑに対するクリックログを読み出す。また、これと同時に、前記クリック確率ＤＢ１２７を参照してランクｒに存在するサイトのクリック確率Ｐ（ｒ）を読み出す。そして、あるランクｒに存在するサイトのクリック回数の分布は二項分布に従うと仮定し、実際のクリック回数ｃ（ｑ，ｒ）が、そのランクが期待するクリック回数（平均クリック回数）を大きく上回っているかどうかを、式（４）により生起確率ｐ（ｑ，ｒ）として算出する。この式（４）は、前記文章検索装置１のプログラムに定義されているものとする。 Specifically, when the analysis unit B124 receives the query q and click log analysis request from the analysis result integration unit 122, the analysis unit B124 reads the click log for the query q from the click log DB 126. At the same time, the click probability P (r) of the site existing in the rank r is read with reference to the click probability DB 127. It is assumed that the distribution of clicks of a site existing in a certain rank r follows a binomial distribution, and the actual number of clicks c (q, r) greatly exceeds the number of clicks (average clicks) that the rank expects. Is calculated as an occurrence probability p (q, r) by the equation (4). This equation (4) is defined in the program of the text search apparatus 1.

ここで、ｎ（ｑ，ｒ）はランクｒのサイトを全閲覧者が見た総回数であり、この単位回数が「試行」にあたる。この総回数ｎ（ｑ，ｒ）は、このランクｒにあるサイトのクリック回数ｃ（ｑ，ｒ）と通り越した回数ｎｃ（ｑ，ｒ）の和で表される。通り越した回数とは、ランクｒより低いランクのサイトがクリックされた回数を示している。もし同一の閲覧者が、ランクｒとランクｒ以下のサイトを連続でクリックした場合などは、閲覧者ＩＤなどで識別して１回とカウントする。 Here, n (q, r) is the total number of times all viewers have viewed the site of rank r, and this unit number corresponds to “trial”. This total number n (q, r) is represented by the sum of the number of clicks c (q, r) of the site in this rank r and the number of times nc (q, r) that passed. The number of times of passing indicates the number of times a site having a rank lower than rank r has been clicked. If the same viewer clicks on a site of rank r and rank r or lower successively, it is identified by the viewer ID and counted once.

ここでは、ランクｒに存在するサイトがクリックされるか否かを二項分布が表現する「１回あたりの試行」とみなし、そのサイトのクリック回数が起こる生起確率ｐ（ｑ，ｒ）が極端に少ない場合に、そのクリック回数は明らかに多いとみなす。 Here, whether or not a site existing in the rank r is clicked is regarded as a “trial per time” expressed by the binomial distribution, and the occurrence probability p (q, r) at which the number of clicks on the site occurs is extremely high. If the number of clicks is small, the number of clicks is considered to be clearly high.

例えば、図５に示すグラフにおいて、横軸はクリック回数、縦軸はその生起確率であり、ランクｒのサイトのクリック確率Ｐ（ｒ）＝３２％、試行回数ｎ（ｑ，ｒ）＝１００の場合のクリック回数の二項分布を示している。つまり、ランクｒとなったサイトのクリック確率が１００回の試行を行って３２％となるとき、期待されるクリック回数の分布は図５のグラフのような二項分布に従うと仮定する。 For example, in the graph shown in FIG. 5, the horizontal axis is the number of clicks, the vertical axis is the occurrence probability, the click probability P (r) = 32% of the rank r site, and the number of trials n (q, r) = 100. The binomial distribution of the number of clicks is shown. In other words, when the click probability of the site having the rank r becomes 32% after 100 trials, it is assumed that the expected distribution of the number of clicks follows a binomial distribution as shown in the graph of FIG.

このとき、実際にランクｒとなったサイトが獲得したクリック回数ｃ（ｑ，ｒ）が、グラフの右側２．５％領域（以下、領域Ｓとする）に含まれる場合、そのサイトは明らかに多くクリックされたと考え、このサイトをアクセス集中サイト（ＡＣＳ）の候補とみなす。そして、前記解析部Ｂ１２４は、このようにして求めたアクセス集中サイト（ＡＣＳ）の候補群を解析結果として前記解析結果統合部１２２へ返信する。 At this time, if the number of clicks c (q, r) acquired by the site actually ranked r is included in the 2.5% region (hereinafter referred to as region S) on the right side of the graph, the site is clearly Considering that many clicks have been made, this site is regarded as a candidate for an access concentration site (ACS). Then, the analysis unit B124 returns the access concentration site (ACS) candidate group obtained in this way to the analysis result integration unit 122 as an analysis result.

なお、図５において領域Ｓの閾値である右側２．５％は、一般にその前後において有意な差が現れるとされる閾値であるが、この値は設定などに合わせて適宜変更することができる。 In FIG. 5, 2.5% on the right side, which is the threshold value of the region S, is a threshold value at which a significant difference generally appears before and after that, but this value can be appropriately changed according to the setting.

前記解析結果統合部１２２は、前記解析部Ａ１２３および前記解析部Ｂ１２４からそれぞれの解析結果を受信すると、これらを基にアクセス集中サイト（ＡＣＳ）を特定する。即ち、クリック回数の生起確率が二項分布の右側２．５％に入っており、かつそのアクセス集中度合ＡＣＤ（ｑ，ｒ）が高いサイトの上位Ｋ件をアクセス集中サイト（ＡＣＳ）とみなし、そのＵＲＬを取得する。そして、このＵＲＬを前記命令部１２１へ返信する。 When the analysis result integration unit 122 receives the analysis results from the analysis unit A123 and the analysis unit B124, the analysis result integration unit 122 specifies an access concentration site (ACS) based on these analysis results. That is, the top K cases of the sites whose click probability is within 2.5% on the right side of the binomial distribution and whose access concentration degree ACD (q, r) is high are regarded as access concentration sites (ACS). Get the URL. Then, this URL is returned to the command unit 121.

このとき、いずれのサイトもクリック回数の生起確率が領域Ｓに入らない場合は、アクセス集中度合ＡＣＤ（ｑ，ｒ）の降順でサイトを順序付け、その上位Ｋ件をアクセス集中サイト（ＡＣＳ）とみなす。この上位件数Ｋの値は、設定に合わせて適宜変更することができる。 At this time, if the occurrence probability of the number of clicks does not enter the region S in any site, the sites are ordered in descending order of the access concentration degree ACD (q, r), and the top K cases are regarded as the access concentration sites (ACS). . The value of the upper number K can be changed as appropriate according to the setting.

前記解析部Ｃ１２５は、前記クリックログＤＢ１２６から全クリックログを読み出し、各ランクのサイトに対するクリック確率Ｐ（ｒ）を算出する。そして、この算出結果を用いて前記クリック確率ＤＢ１２７を更新する。この更新処理は予め設定された一定時間ごとに行われる。 The analysis unit C125 reads all click logs from the click log DB 126 and calculates a click probability P (r) for each rank site. Then, the click probability DB 127 is updated using this calculation result. This update process is performed at predetermined time intervals.

（３）拡張語取得フェーズ
拡張語取得フェーズでは、クリックログ解析フェーズで特定したアクセス集中サイト（ＡＣＳ）のタイトルと概要文（スニペット）から、クエリ拡張を行うための拡張語を取得する。これは、アクセスの集中するサイトのタイトルとスニペットには閲覧者が有用だと判断したキーワードがあり、このキーワードでクエリ拡張をすることで大幅な検索の精度向上が期待できるためである。 (3) Extended word acquisition phase In the extended word acquisition phase, an extended word for query expansion is acquired from the title and summary sentence (snippet) of the access concentrated site (ACS) specified in the click log analysis phase. This is because the titles and snippets of sites where access is concentrated have keywords that the viewer has determined to be useful, and by expanding the query with these keywords, a significant improvement in search accuracy can be expected.

図６は、拡張語取得フェーズの処理フローを示している。前記情報抽出部１２８が前記命令部１２１からアクセス集中サイト（ＡＣＳ）のＵＲＬと前記クエリを受信した後に（図３の記号Ｂから続く）、拡張語取得フェーズが開始される。 FIG. 6 shows the process flow of the extended word acquisition phase. After the information extraction unit 128 receives the URL of the access concentration site (ACS) and the query from the command unit 121 (continuing from the symbol B in FIG. 3), an extended word acquisition phase is started.

前記情報抽出部１２８は、受信したアクセス集中サイト（ＡＣＳ）のＵＲＬと前記クエリを前記照合部１２９へ送信する。前記照合部１２９は、前記クエリを前記検索結果取得部１３０へ送信し、続いて前記検索結果取得部１３０はこのクエリを前記検索エンジン１３１へ投入する。 The information extraction unit 128 transmits the received URL of the access concentration site (ACS) and the query to the collation unit 129. The collation unit 129 transmits the query to the search result acquisition unit 130, and then the search result acquisition unit 130 inputs the query to the search engine 131.

前記検索エンジン１３１は、このクエリを用いて前記インデックスＤＢ１３２に対して検索を実行し、ＵＲＬと検索結果件数Ｎを取得して、これらを前記検索結果取得部１３０へ返信する。これらを受信した前記検索結果取得部１３０は、検索結果（ＵＲＬと検索結果件数Ｎ）を前記照合部１２９へ返信する。 The search engine 131 executes a search for the index DB 132 using this query, acquires a URL and the number N of search results, and returns these to the search result acquisition unit 130. Upon receiving these, the search result acquisition unit 130 returns the search results (URL and the number N of search results) to the verification unit 129.

前記照合部１２９は、前記検索結果取得部１３０から受信した検索結果のＵＲＬと、前記情報抽出部１２８から受信したアクセス集中サイト（ＡＣＳ）のＵＲＬとを照合して、アクセス集中サイト（ＡＣＳ）を識別する。そして、識別したアクセス集中サイト（ＡＣＳ）の情報（ＵＲＬ、タイトル、スニペットなど）と前記検索結果件数Ｎとを前記情報抽出部１２８へ返信する。 The collation unit 129 collates the URL of the search result received from the search result acquisition unit 130 with the URL of the access concentration site (ACS) received from the information extraction unit 128 to obtain an access concentration site (ACS). Identify. Then, the information (URL, title, snippet, etc.) of the identified access concentration site (ACS) and the search result number N are returned to the information extraction unit 128.

前記情報抽出部１２８は、受信したアクセス集中サイト（ＡＣＳ）の前記情報からタイトルとスニペットを抽出し、前記検索結果件数Ｎとともに前記機能語抽出・名詞句生成部１３３へ送信する。 The information extraction unit 128 extracts a title and a snippet from the received information on the access concentration site (ACS), and transmits the extracted information and the search result number N to the function word extraction / noun phrase generation unit 133.

前記機能語抽出・名詞句生成部１３３は、受信したタイトルとスニペットを形態素に分解し、機能語の取得および名詞句の生成を行う。そして、この機能語および名詞句を後に行うクエリ拡張に用いる拡張語の候補群として、前記検索結果件数Ｎとともに前記拡張語の重み・順序付け部１３４へ送信する。 The function word extraction / noun phrase generation unit 133 decomposes the received title and snippet into morphemes, acquires a function word, and generates a noun phrase. Then, the function word and the noun phrase are transmitted to the weight / ordering unit 134 of the expanded word together with the search result number N as a candidate group of expanded words used for query expansion performed later.

前記拡張語の重み・順序付け部１３４は、受信した各拡張語の候補ｉと、各拡張語の候補ｉをタイトルとスニペットに含むアクセス集中サイト（ＡＣＳ）の個数ａｎ（ｉ）、前記検索結果件数Ｎ、アクセス集中サイト（ＡＣＳ）の全個数Ｋを用いて、非特許文献２の「ＲｏｂｅｒｔｓｏｎＳｅｌｅｃｔｉｏｎＶａｌｕｅ（ＲＳＶ）」の式（５）および式（６）を用いて拡張語の重み、順序付けを行う。この式（５）および式（６）は、前記文章検索装置１のプログラムに定義されているものとする。 The extended word weighting / ordering unit 134 receives each extended word candidate i, the number of access concentrated sites (ACS) including each extended word candidate i in the title and snippet an (i), and the number of search results. N, using the total number of access concentrated sites (ACS) K, the weights and ordering of extended words using the formulas (5) and (6) of “Robertson Selection Value (RSV)” of Non-Patent Document 2 . These expressions (5) and (6) are defined in the program of the text search apparatus 1.

ここで、ｎ（ｉ）は検索結果件数Ｎ中、拡張語の候補ｉをタイトルとスニペットに含むサイト数である。前記拡張語の重み・順序付け部１３４は、このように順序付けられた拡張語の候補群を前記拡張語選択部１２０へ送信する。 Here, n (i) is the number of sites that include the expansion word candidate i in the title and snippet in the search result number N. The extended word weight / ordering unit 134 transmits the extended word candidate group thus ordered to the extended word selection unit 120.

前記拡張語選択部１２０は、受信した拡張語の候補群のうち上位Ｔ件を実際にクエリ拡張で使用する拡張語として採用し、採用した拡張語群を前記検索式生成部１１０へ返信する。この上位件数Ｔの値は、設定に合わせて適宜変更することができる。 The extended word selection unit 120 employs the top T of the received expanded word candidate groups as expanded words that are actually used in query expansion, and returns the employed expanded word group to the search expression generation unit 110. The value of the upper order number T can be changed as appropriate according to the setting.

（４）検索実行フェーズ
検索実行フェーズでは、拡張語取得フェーズで取得した拡張語を用いて検索式を生成し、この検索式による検索を実行するとともに、検索結果に対する利用者の判断（クリック）をクリックログに反映させる。 (4) Search execution phase In the search execution phase, a search expression is generated using the extended word acquired in the extended word acquisition phase, and a search based on the search expression is executed, and the user's judgment (click) on the search result is performed. Reflect in the click log.

図７は、検索実行フェーズの処理フローを示している。前記検索式生成部１１０は、前記拡張語選択部１２０から拡張語群を受信すると（図６の記号Ｃから続く）、この拡張語群を用いてクエリ拡張を実行する検索式を生成し、これを前記検索実行部１４０へ送信する。 FIG. 7 shows the processing flow of the search execution phase. When receiving the extended word group from the extended word selecting unit 120 (following the symbol C in FIG. 6), the search expression generating unit 110 generates a search expression for executing query expansion using the extended word group, Is transmitted to the search execution unit 140.

前記検索実行部１４０は、受信した検索式を前記検索エンジン１３１へ投入する。前記検索エンジン１３１はこの検索式を用いて、前記インデックスＤＢ１３２に格納されたサイトのタイトルとボディ（本文）それぞれに対して検索を行う。 The search execution unit 140 inputs the received search expression to the search engine 131. The search engine 131 uses the search formula to search each of the site title and body (text) stored in the index DB 132.

この検索式では、まず利用者が入力した初期クエリに対する検索結果の集合（ランク付けされたサイト群）を決定し、次にこのサイト群に対して前記拡張語を用いて採点を行い、この点数に基づいてより精度の高いランキングに並べ替える方法が定義されている。 In this search formula, first, a set of search results (ranked site group) for the initial query input by the user is determined, and then the site group is scored using the extended word, and this score is obtained. A method of rearranging the rankings with higher accuracy based on the above is defined.

具体的には、初期クエリに対する検索結果の各サイトの要素である文章に対して、「拡張語の重みｒｓｖ（ｉ）×各文章における拡張語のｔｆ・ｉｄｆ」を算出して採点を行い、この点数でサイト群を並べ替える。ここで「ｔｆ・ｉｄｆ」は、ある単語が一つの文書にどのくらい出現するかなどの尺度により求められる単語の重みを表す。そして、このように並べ替えた結果を前記結果表示画面１０２へ表示する。したがって、利用者は、最初に入力したクエリに対し、より高い精度で並べられたランキングを検索結果として確認することができる。 Specifically, for the sentence that is an element of each site of the search result for the initial query, “extension word weight rsv (i) × extension word tf · idf in each sentence” is calculated and scored, Sort sites by this score. Here, “tf · idf” represents the weight of a word obtained by a measure such as how much a certain word appears in one document. Then, the rearranged result is displayed on the result display screen 102. Therefore, the user can confirm the ranking arranged with higher accuracy as a search result with respect to the first input query.

この検索結果に対し利用者は、サイトを選択する判断（クリック）を行う。このクリック情報（利用者がクリックしたサイトの情報）は前記クリックログフィードバック処理部１５０へ送信される。前記クリックログフィードバック処理部１５０は、このクリック情報から新たなクリックログを生成し、これを用いて前記クリックログＤＢ１２６を随時更新する。 For this search result, the user makes a decision (click) to select a site. This click information (information on the site clicked by the user) is transmitted to the click log feedback processing unit 150. The click log feedback processing unit 150 generates a new click log from the click information, and updates the click log DB 126 as needed using the click log.

以上のように、本実施形態に係る文章検索装置１によれば、アクセス集中サイト（ＡＣＳ）を特定する際、二つの独立した尺度であるアクセス集中度合ＡＣＤおよびクリック回数の生起確率を組み合わせて特定していることから、アクセス集中サイト（ＡＣＳ）の的確な判別が可能となる。 As described above, according to the text search device 1 according to the present embodiment, when specifying an access concentration site (ACS), it is specified by combining two independent measures, the access concentration degree ACD and the occurrence probability of the number of clicks. Therefore, it is possible to accurately determine the access concentration site (ACS).

また、アクセス集中サイト（ＡＣＳ）のタイトルとスニペットから抽出されるキーワードを用いてクエリ拡張を行うことから、少ない拡張語数（１〜５語）で検索の大幅な精度向上を実現することができる。 In addition, since query expansion is performed using keywords extracted from the title and snippet of the access intensive site (ACS), it is possible to achieve a significant improvement in search accuracy with a small number of expanded words (1 to 5 words).

さらに、拡張語の取得の際に利用するクリックログＤＢは、利用者からのフィードバックの自動処理により随時更新されることから、クエリに対して時々刻々と変化する時代背景に沿った拡張語を適切に抽出することが可能となる。 In addition, the click log DB used to acquire extended words is updated as needed by automatic processing of feedback from users. Can be extracted.

＜他例＞
前記クリックログＤＢ１２６と前記インデックスＤＢ１３２とは、必ずしも前記文章検索装置１に実装される必要はなく、例えば前記文章検索装置１にネットワークを介して接続された図外のサーバ内に実装した態様であってもよい。この場合に、前記両ＤＢ１２６．１３２への接続は前記通信デバイスを介して行われる。 <Other examples>
The click log DB 126 and the index DB 132 are not necessarily implemented in the text search apparatus 1, and are, for example, implemented in a server (not shown) connected to the text search apparatus 1 via a network. May be. In this case, the connection to both DBs 126.132 is made via the communication device.

即ちクリックログ解析フェーズでは、前記解析部Ａ１２３および前記解析部Ｂ１２４が、前記通信デバイスを介して前記クリックログＤＢ１２６に接続して、前記クエリに対応するクリックログを取得する。取得したクリックログのデータは前記メモリ（ＲＡＭ）に一時記憶され、前記両解析部１２３．１２４は前述の手法によりこのデータを解析する。 That is, in the click log analysis phase, the analysis unit A123 and the analysis unit B124 connect to the click log DB 126 via the communication device and acquire a click log corresponding to the query. The acquired click log data is temporarily stored in the memory (RAM), and both the analysis units 123.124 analyze this data by the above-described method.

また、前記解析部Ｃ１２５は、同様に前記通信デバイスを介して前記クリックログＤＢ１２６へ接続し、全てのクリックログを取得する。そして、取得したクリックログからクリック確率を算出し、前記クリック確率ＤＢ１２７を更新する。 Similarly, the analysis unit C125 connects to the click log DB 126 via the communication device and acquires all click logs. Then, the click probability is calculated from the acquired click log, and the click probability DB 127 is updated.

拡張語取得フェーズおよび検索実行フェーズでは、前記検索エンジン１３１が前記通信デバイスを介して前記インデックスＤＢ１３２へ接続し、検索を行う。また、前記クリックログフィードバック処理部１５０は、同様に前記通信デバイスを介して前記クリックログＤＢ１２６へ接続し、前記クリックログＤＢ１２６を随時更新する。 In the extended word acquisition phase and the search execution phase, the search engine 131 connects to the index DB 132 via the communication device and performs a search. Further, the click log feedback processing unit 150 similarly connects to the click log DB 126 via the communication device, and updates the click log DB 126 as needed.

なお、本発明は、コンピュータを前記文章検索装置１の各機能ブロック１００．１１０．１２０．１４０として機能させる文章検索プログラムとしても提供することができる。このプログラムは、各機能ブロック１００．１１０．１２０．１４０の全ての処理をコンピュータに実行させるものでもよく、あるいはその一部の処理を実行させるものであってもよい。 The present invention can also be provided as a text search program that causes a computer to function as each functional block 100.110.120.140 of the text search device 1. This program may cause the computer to execute all the processes of the functional blocks 100.110.120.140, or may execute a part of the processes.

このプログラムは、Ｗｅｂサイトなどからのダウンロードによってコンピュータに提供される。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，Ｂｌｕ−ｒａｙＤｉｓｋ（登録商標）などの記録媒体に格納してコンピュータに提供してもよい。この記録媒体から読み出されたプログラムコードが、本実施形態の各機能ブロックとしてコンピュータを機能させるので、該記録媒体も本発明を構成する。 This program is provided to the computer by downloading from a website or the like. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, Blu-ray Disk (registered trademark). It may be provided to a computer. Since the program code read from the recording medium causes the computer to function as each functional block of the present embodiment, the recording medium also constitutes the present invention.

本発明の実施形態に係る文章検索装置の構成図。The lineblock diagram of the text search device concerning the embodiment of the present invention. 同クエリ入力フェーズの処理フロー図。The processing flow figure of the same query input phase. 同クリックログ解析フェーズの処理フロー図。The processing flow diagram of the click log analysis phase. 同アクセス集中度合ＡＣＤを示すグラフ。The graph which shows the same access concentration degree ACD. 同クリック回数の二項分布を示すグラフ。A graph showing the binomial distribution of the number of clicks. 同拡張語取得フェーズの処理フロー図。The processing flow figure of the same extended word acquisition phase. 同検索実行フェーズの処理フロー図。The processing flow figure of the same search execution phase.

Explanation of symbols

１…文章検索装置
１００…表示部
１０１…クエリ入力画面
１０２…結果表示画面
１１０…検索式生成部
１２０…拡張語選択部
１２１…命令部
１２２…解析結果統合部
１２３…解析部Ａ（第１解析手段）
１２４…解析部Ｂ（第２解析手段）
１２５…解析部Ｃ
１２６…クリックログＤＢ
１２７…クリック確率ＤＢ
１２８…情報抽出部
１２９…照合部
１３０…検索結果取得部
１３１…検索エンジン
１３２…インデックスＤＢ
１３３…機能語抽出・名詞句生成部
１３４…拡張語の重み・順序付け部
１４０…検索実行部
１５０…クリックログフィードバック処理部 DESCRIPTION OF SYMBOLS 1 ... Text search device 100 ... Display part 101 ... Query input screen 102 ... Result display screen 110 ... Search formula production | generation part 120 ... Extended word selection part 121 ... Command part 122 ... Analysis result integration part 123 ... Analysis part A (1st analysis) means)
124... Analysis unit B (second analysis means)
125 ... analysis unit C
126 ... Click log DB
127 ... Click probability DB
128 ... Information extraction unit 129 ... Verification unit 130 ... Search result acquisition unit 131 ... Search engine 132 ... Index DB
133 ... Function word extraction / noun phrase generation unit 134 ... Extended word weight / ordering unit 140 ... Search execution unit 150 ... Click log feedback processing unit

Claims

A sentence search device that acquires an extended word related to an input query and searches a site using the extended word and the query,
Click log analysis means for analyzing a click log for the query in which a site actually selected by a visitor is ranked based on the number of clicks, and identifying a site where access is concentrated;
Analyzing the title and summary sentence of the site where the access is concentrated to obtain the extended word, and an extended word acquisition means,
The click log analysis means calculates a difference in the number of clicks between sites existing in adjacent ranks in the click log as a relative number of clicks, and calculates a degree of access concentration of the site according to the relative number of clicks Means,
By referring to a database in which the click probability for each rank of the site is stored for each search, the actual number of clicks with respect to the average click probability of the rank is calculated as the occurrence probability, and the access concentration site using the occurrence probability and the threshold value A second analysis means for obtaining a candidate for
An analysis result integration unit that integrates the analysis results of the two analysis units and identifies a site where access is concentrated;
A sentence search device characterized by comprising:

The extended word acquisition means includes
Means for analyzing the title and summary sentence of the site where the access is concentrated to obtain a candidate group of extended words;
Means for ordering each candidate for the extended word;
The sentence search apparatus according to claim 1, further comprising: means for selecting an extended word based on the ranking of the ordered candidates.

Search execution means for searching a site using the query and the extended word and outputting a search result;
The text search apparatus according to claim 1, further comprising: a click log feedback processing unit that reflects user click information on the search result of the search execution unit in the click log. .

A sentence search method for acquiring an extended word related to an input query and searching a site using the extended word and the query,
A first step in which a click log analyzing unit analyzes a click log for the query in which a site actually selected by a viewer is ranked based on the number of clicks, and identifies a site where access is concentrated;
An extended word acquisition means comprising: a second step of acquiring the extended word by analyzing a title and a summary sentence of the site where the access is concentrated,
The first step includes
Calculating the difference in the number of clicks between sites existing in adjacent ranks in the click log as the number of relative clicks, and determining the access concentration degree of the site according to the relative number of clicks ;
By referring to a database in which the click probability for each rank of the site is stored for each search, the actual number of clicks with respect to the average click probability of the rank is calculated as the occurrence probability, and the access concentration site using the occurrence probability and the threshold value Seeking a candidate for,
And a step of identifying the site where access is concentrated by integrating the analysis results of the two steps.

The second step includes
Analyzing the title and summary sentence of the site where access is concentrated to obtain a candidate group of extended words;
Ordering each candidate for the extended word;
The sentence search method according to claim 4 , further comprising: selecting an extended word based on the ranking of each of the ordered candidates.

A search execution means for searching a site using the query and the extension word and outputting a search result;
6. The text search according to claim 4 , further comprising a step of causing the click log feedback processing means to reflect user click information on the search result of the step in the click log. Method.

A text search program that causes a computer to execute each step of the text search method according to claim 4 .

A computer-readable recording medium on which the text search program according to claim 7 is recorded.