JP2012164242A

JP2012164242A - Related word extraction device, related word extraction method, related word extraction program

Info

Publication number: JP2012164242A
Application number: JP2011025579A
Authority: JP
Inventors: Naoki Fujita; 尚樹藤田; Norifumi Katabuchi; 典史片渕; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-09
Filing date: 2011-02-09
Publication date: 2012-08-30
Anticipated expiration: 2031-02-09
Also published as: JP5547669B2

Abstract

【課題】多様性のある関連語をより少ないデータの解析で抽出し、ユーザに提示する。
【解決手段】バースト検出手段４は、検索ログＤＢ３のクエリログを任意の日時単位で解析し、検索回数が急増した検索語を検出する。検出結果はバースト情報ＤＢ５に保存する。関連語抽出手段６は、前記ＤＢ５の保存データを参照して検出日毎に検索語を取得し、クエリログから検索語の関連語群を抽出する。抽出された関連語群はバースト関連語情報ＤＢ７に保存される。関連語出力手段８は、検索エンジン２からの要求としてユーザ入力の検索語を受け取る。受け取った検索語の関連語を前記ＤＢ７から取得し、検索エンジン２に返答する。
【選択図】図１Kind Code: A1 Abstract: Various related terms are extracted by analyzing less data and presented to a user.
Burst detection means 4 analyzes a query log of a search log DB 3 in an arbitrary date and time unit, and detects a search term whose search frequency has increased rapidly. The detection result is stored in the burst information DB 5. The related word extraction means 6 refers to the data stored in the DB 5 to acquire a search word for each detection date, and extracts a related word group of search words from the query log. The extracted related word group is stored in the burst related word information DB 7. The related word output means 8 receives a user input search word as a request from the search engine 2. The related word of the received search word is acquired from the DB 7 and returned to the search engine 2.
[Selection] Figure 1

Description

本発明は、検索エンジンの技術分野、特に検索ログから検索語（キーワード）の関連語を抽出し、抽出された関連語を追加の検索語としてユーザに推薦する技術に関する。 The present invention relates to a technical field of a search engine, and more particularly to a technique for extracting related words of a search word (keyword) from a search log and recommending the extracted related word to a user as an additional search word.

周知のように検索エンジンは、ユーザ入力の検索語の文字列（クエリ）に応じてインターネット上の文書のリストを応答し、ユーザはインターネット上の膨大な情報から必要な情報を得るために検索エンジンを利用している。ここでユーザが検索エンジンに入力する検索語としては、ユーザの知りたい「人、モノ、出来事」などの名詞が入力されることが多い。 As is well known, a search engine responds with a list of documents on the Internet according to a character string (query) of a search term input by a user, and the user can obtain necessary information from a vast amount of information on the Internet. Is used. Here, nouns such as “people, things, and events” that the user wants to know are often input as search terms that the user inputs to the search engine.

ところが、検索語が複数の意味を有する場合やユーザが検索語について何か特定の事柄を知りたいという検索要求を持っている場合がある。前者の例としては「ヤクルト（登録商標）」という検索語に対しては飲料品としての「ヤクルト」の他に、企業や野球球団の意味も存在する。また、後者の例としては、「京都」という検索語に対して、「京都」の「紅葉」について知りたいユーザや「花見」「土産」などについて知りたいユーザなども存在する。 However, there are cases where a search word has a plurality of meanings, or a user has a search request to know something specific about the search word. As an example of the former, for the search term “Yakult (registered trademark)”, there is a meaning of a company or a baseball team in addition to “Yakult” as a beverage. In addition, as the latter example, there are a user who wants to know “autumn leaves” of “Kyoto” and a user who wants to know “cherry-blossom viewing”, “souvenir” and the like for the search term “Kyoto”.

そこで、「ｇｏｏ（登録商標）」や「Ｇｏｏｇｌｅ（登録商標）」などの検索エンジンは、図１１に示すように、ユーザの検索補助などのために現在の流行を考慮した関連語を追加の検索語として推薦・提示するガイド機能を有し、ユーザの利便性を高めている。この関連語については、図１２に示すように、クエリログにおける検索語の共起性や検索語同士におけるクリック先ＵＲＬの共通性から抽出する方法、あるいは検索語による検索結果の上位にランクされた電子文書（Ｗｅｂページ）内で検索語の周辺に出現する単語から抽出する特許文献１の方法が知られている。 Therefore, search engines such as “goo (registered trademark)” and “Google (registered trademark)” additionally search related words taking into account the current trend for user search assistance as shown in FIG. It has a guide function that recommends and presents words, improving user convenience. As for this related word, as shown in FIG. 12, a method of extracting from the co-occurrence of the search word in the query log and the commonality of the click destination URL between the search words, or the electronic ranked in the top of the search result by the search word A method of Patent Document 1 is known that extracts from words appearing around a search word in a document (Web page).

特開２００６−１３９４８４JP 2006-139484 A

しかしながら、ユーザに追加の検索語を提示するＷｅｂページ上のスペースは有限であるものの、関連語を関連度順に表示すると同じ意味合いの関連語（例えば「京都」に対して「桜」「花見」「さくら」など）が上位に並んでしまうおそれがある。 However, although the space on the Web page for presenting additional search terms to the user is limited, when related terms are displayed in the order of relevance, related terms having the same meaning (for example, “Cherry”, “Hanami”, “ Sakura "etc.) may be placed at the top.

また、多様な意図を持ったユーザが存在することから、検索の利便性を向上させるためには同じ意図と考えられる関連語ばかりではなく、異なる意図の関連語も幅広く推薦・提示することが好ましい。 In addition, since there are users with various intentions, it is preferable to recommend and present not only related words that are considered to have the same intention but also related words that have different intentions in order to improve the convenience of search. .

ところが、特許文献１などの従来手法で抽出した関連語の集合を意味的にクラスタリングするためには各語の意味情報を保持したうえで計算を実行しなければならない。この場合に検索語には膨大な種類の語が存在するため、すべての語の意味情報を保持することは困難である。 However, in order to semantically cluster a set of related words extracted by a conventional method such as Patent Document 1, it is necessary to perform calculation after retaining the semantic information of each word. In this case, since there are a large number of types of search words, it is difficult to hold the semantic information of all the words.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、多様性のある関連語をより少ないデータの解析で抽出し、ユーザに提示することを解決課題としている。 The present invention has been made in order to solve the above-described problems of the prior art, and it is an object of the present invention to extract various related words by analyzing less data and present them to the user. .

そこで、本発明は、検索語に流行があることに着目し、検索語が急増した期間と検索語とを１つのグループ（バースト情報）とし、その期間の関連語（例えば検索語と同時に検索された語など）を求める。このグループ毎に検索エンジンの要求に応じて関連語群を抽出し、ユーザに提示する。 Therefore, the present invention pays attention to the fact that search terms have a trend, and the term in which the search terms have increased rapidly and the search terms are grouped into one group (burst information), and the related terms (for example, the search terms are searched at the same time). For example). A group of related terms is extracted for each group in response to a request from the search engine and presented to the user.

本発明に係る関連語抽出装置は、検索ログを任意の単位で解析し、該解析単位ごとに検索回数が閾値以上の割合で増加している検索語を検出する検出手段と、検出手段の検出した検索語に関連する関連語群を検索ログから抽出し、抽出された関連語群を前記解析単位ごとにグループ化し、該グループ群を検索語と併せてデータベースに保存する関連語抽出手段と、検索エンジンの要求に応じてデータベースからユーザ入力の検索語の関連語を重複無く順に選択し、選択された関連語群を検索エンジンに出力する関連語出力手段と、を備える。 The related word extraction apparatus according to the present invention analyzes a search log in an arbitrary unit, detects a search word in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit, and detection by the detection unit A related word extracting means for extracting a related word group related to the search word from the search log, grouping the extracted related word group for each analysis unit, and storing the group group together with the search word in a database; And related word output means for sequentially selecting related words of search words input by the user from the database in response to a request from the search engine, and outputting the selected related word group to the search engine.

本発明に係る関連語抽出方法は、検索ログを任意の単位で解析し、該解析単位ごとに検索回数が閾値以上の割合で増加している検索語を検出する検出ステップと、検出ステップで検出した検索語に関連する関連語群を検索ログから抽出し、抽出された関連語群を前記解析単位ごとにグループ化し、該グループ群を検索語と併せてデータベースに保存する関連語抽出ステップと、検索エンジンの要求に応じてデータベースからユーザ入力の検索語の関連語を重複無く順に選択し、選択された関連語群を検索エンジンに出力する関連語出力ステップと、を有する。 The related word extraction method according to the present invention includes a detection step of analyzing a search log in an arbitrary unit, and detecting a search word in which the number of searches is increasing at a rate equal to or greater than a threshold for each analysis unit, and detecting in the detection step Extracting a related word group related to the search term from the search log, grouping the extracted related word group for each analysis unit, and storing the group group together with the search word in a database; and A related word output step of sequentially selecting related words of the search words input by the user from the database in response to a request of the search engine, and outputting the selected related word group to the search engine.

前記各態様においてデータベースに保存されたグループあるいはグループ内の関連語群をクラスタリングすることもできる。クラスタリングには、グループ内の関連語の類似度・クリック先の情報などが利用される。 In each of the above embodiments, the groups stored in the database or related word groups in the group can be clustered. For clustering, the similarity of related words in a group, information on a click destination, and the like are used.

なお、本発明は、前記抽出装置としてコンピュータを機能させるプログラムの態様としてもよい。このプログラムは、ネットワークや記録媒体などを通じて提供することができる。 In addition, this invention is good also as an aspect of the program which makes a computer function as said extraction apparatus. This program can be provided through a network or a recording medium.

本発明によれば、多様性のある関連語をより少ないデータの解析で抽出し、ユーザに提示することができる。 According to the present invention, various related words can be extracted by analyzing less data and presented to the user.

本発明の第１実施形態に係る関連語抽出装置の構成図。The lineblock diagram of the related term extraction device concerning a 1st embodiment of the present invention. 同関連語出力手段の説明図。Explanatory drawing of the related word output means. 同バースト検出日毎に関連語を抽出する説明図。Explanatory drawing which extracts a related word for every burst detection day. 同図３のイメージ図。The image figure of the same FIG. 同関連語出手段の出力データ例（キーワード＝ヤクルト）Output data example of the related wording means (keyword = Yakult) 同関連語出手段の出力データ例（キーワード＝ディズニー）Output data example of the relevant wording means (keyword = Disney) 同関連語出手段の出力データ例（キーワード＝ロールケーキ）Output data example of the related terming means (keyword = roll cake) 同関連語抽出手段の他の処理例を示す図。The figure which shows the other process example of the related word extraction means. 本発明の第２実施形態に係る関連語抽出装置の構成図。The block diagram of the related word extraction apparatus which concerns on 2nd Embodiment of this invention. 同関連語出力手段の説明図。Explanatory drawing of the related word output means. 検索エンジンのガイド機能の説明図。Explanatory drawing of a search engine guide function. 従来の関連語抽出方法を示す図。The figure which shows the conventional related word extraction method.

≪第１実施形態≫
図１に基づき本発明の第１実施形態に係る関連語抽出装置を説明する。この抽出装置１は、検索エンジン２の検索ログＤＢ３から検索語に関連する関連語を抽出し、抽出された関連語を検索エンジン２の要求に応じて出力する。具体的には、前記抽出装置１は、コンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ．メモリ（ＲＡＭ）．ハードディスクドライブ装置などを備える。 << First Embodiment >>
A related word extraction apparatus according to the first embodiment of the present invention will be described with reference to FIG. The extraction device 1 extracts related words related to the search word from the search log DB 3 of the search engine 2, and outputs the extracted related word in response to a request from the search engine 2. Specifically, the extraction device 1 is configured by a computer, and hardware resources of a normal computer such as a CPU. Memory (RAM). A hard disk drive device is provided.

このハードウェアリソースとソフトウェアリソースとの協働の結果、前記抽出装置１は、バースト検出手段４．バースト情報ＤＢ５．関連語抽出手段６．バースト関連語情報ＤＢ７．関連語出力手段８を実装する。この各ＤＢ３．５．７は、メモリ（ＲＡＭ）やハードディスクドライブ装置などの記憶装置に構築されているものとする。ここで前記ＤＢ３には検索エンジン２の検索窓から得られた検索ログが記録されている。ここでは検索ログには、検索に利用した検索語のクエリログと、該検索語の検索結果に対するクリック先情報（ＵＲＬやページ情報など）のクリックログとが含まれる。 As a result of the cooperation between the hardware resource and the software resource, the extraction device 1 uses the burst detection means 4. Burst information DB5. Related term extraction means6. Burst related word information DB7. The related word output means 8 is implemented. Each DB 3.5.7 is assumed to be built in a storage device such as a memory (RAM) or a hard disk drive device. Here, a search log obtained from the search window of the search engine 2 is recorded in the DB 3. Here, the search log includes a query log of a search term used for the search and a click log of click destination information (such as URL and page information) for the search result of the search term.

前記検出手段４は、前記ＤＢ３に記録されたクエリログのデータを任意の日時単位で解析する。ここでは一例として１日単位でクエリログのデータを解析し、検索回数が急増した検索語を検出する（バースト検出ステップ）。この検索回数が急増したか否か、即ちバーストしたか否かを判定するにあたっては閾値を用いる。例えば過去数日間（任意の期間）の移動平均値に対して３σ以上（σは標準偏差）で検索回数が上昇している場合に異常時と判定し、バーストと検出してよい。このとき前記検出手段４は、バーストの検出日とバーストした検索語とをペアにしたバースト情報を前記ＤＢ５に保存する。 The detection means 4 analyzes query log data recorded in the DB 3 in arbitrary date and time units. Here, as an example, the data of the query log is analyzed on a daily basis, and a search word whose search frequency has increased rapidly is detected (burst detection step). A threshold is used to determine whether or not the number of searches has increased rapidly, that is, whether or not a burst has occurred. For example, when the number of searches increases by 3σ or more (σ is a standard deviation) with respect to the moving average value for the past several days (arbitrary period), it may be determined as abnormal and may be detected as a burst. At this time, the detection means 4 stores in the DB 5 burst information in which a burst detection date and a burst search word are paired.

前記抽出手段６は、前記ＤＢ５の格納データを参照し検出日毎に検索語を取得し、クエリログのデータを解析して検索語の関連語群を抽出する（関連語抽出ステップ）。すなわち、前記ＤＢ５の格納データ中から検索日と検索語で絞り込んだクエリログのデータのみを解析し、検索語の関連語群を抽出する。関連語の抽出にあたっては、単に検索時に該検索語と同時に入力された語の共起回数や、該共起回数を該検索語の検索数で除算した「ｔｆ・ｉｄｆ」のような指標を用いてもよく、特許文献１のように検出日の検索結果の上位のＷｅｂページを解析して関連語を抽出してもよい。 The extraction means 6 refers to the data stored in the DB 5 to acquire a search term for each detection date, analyzes the data in the query log, and extracts a related term group of search terms (related term extraction step). That is, only the query log data narrowed down by the search date and the search word from the stored data of the DB 5 is analyzed, and the related word group of the search word is extracted. When extracting related terms, the number of co-occurrence of words input simultaneously with the search term at the time of search or an index such as “tf · idf” obtained by dividing the number of co-occurrence by the number of searches of the search term is used. Alternatively, as in Patent Document 1, a related word may be extracted by analyzing a high-order Web page of the search result of the detection date.

ここで抽出した関連語群は、バーストの検出日毎に関連語グループにグループ化され、検索語と併せて関連語グループ群が前記ＤＢ７に保存される。具体的には前記ＤＢ７には、「検索語．関連語グループ（検出日：「ｗｏｒｄ｜ｓｃｏｒｅ，ｗｏｒｄ｜ｓｃｏｒｅ，ｗｏｒｄ｜ｓｃｏｒｅ），関連語グループ（検出日：「ｗｏｒｄ｜ｓｃｏｒｅ，ｗｏｒｄ｜ｓｃｏｒｅ，ｗｏｒｄ｜ｓｃｏｒｅ），・・・」として記録される。ここで各関連語（ｗｏｒｄ）と対に記録される「ｓｃｏｒｅ」は、関連語の関連度や「ｔｆ・ｉｄｆ」のなどのスコア情報を示している。 The related word group extracted here is grouped into related word groups for each burst detection date, and the related word group group is stored in the DB 7 together with the search word. Specifically, the DB 7 includes “search word. Related word group (detection date:“ word | score, word | score, word | score) ”, related word group (detection date:“ word | score, word | score, word | score),... Here, “score” recorded in pairs with each related word (word) indicates the degree of relevance of the related word and score information such as “tf · idf”.

前記出力手段８は、ユーザが図示省略の端末（例えばＰＣ．携帯電話など）を通じて検索エンジン２の検索窓に検索語を入力したときに、該検索エンジン２のフロントエンドからの要求としてユーザ入力の検索語を受け取る。受け取った検索語のレコードを前記ＤＢ７から取得し、図２に示すように、取得したレコードの関連語グループからラウンドロビン形式で重複無く関連語を選択する。 The output means 8 receives a user input as a request from the front end of the search engine 2 when a user inputs a search word into the search window of the search engine 2 through a terminal (not shown) (for example, a PC or a mobile phone). Receive search terms. The received search word record is acquired from the DB 7, and as shown in FIG. 2, related words are selected from the related word group of the acquired records without duplication in a round robin format.

ここで選択された任意数の関連語が検索エンジン２のフロントエンドに返答され、検索エンジン２の前記ガイド機能に利用される。すなわち、検索エンジン２が受け取った関連語群が、ユーザ入力の検索語に追加する推薦語として検索窓などの追加語の提示スペースに一覧表示される。 Any number of related terms selected here are returned to the front end of the search engine 2 and used for the guide function of the search engine 2. That is, the related word group received by the search engine 2 is displayed as a list in the additional word presentation space such as a search window as a recommended word to be added to the search word input by the user.

これにより検索語の流行（バースト）を考慮して関連語群を抽出し、ユーザに検索の推薦語として提示することができる。すなわち、ある検索語の検索回数は一定ではなく、一時的に急増（バースト）することがある。この代表的な原因としては、ＴＶや新聞などのメディアで取り上げられたり、あるＷｅｂ上のコミュニティ（掲示板やソーシャルメディアなど）で話題となったことなどが挙げられる。 As a result, related word groups can be extracted in consideration of the search word trend (burst) and presented to the user as search recommended words. That is, the number of searches for a certain search term is not constant, and may increase rapidly (burst). Typical reasons for this include being taken up by media such as TV and newspapers, and being talked about in certain Web communities (such as bulletin boards and social media).

このような検索語の検索回数が一時的に急増したバースト検出日には、ある検索語は特定の検索意図に偏っているため（特定の検索意図の割合が多い）、その意図に沿った関連語をクエリログから抽出することができる。例えばバースト検出日ではなく、定常状態のときに「ヤクルト」という検索語を使うユーザのうち、（「飲料品」，「健康食品」，「ヤクルト球団」）の情報を意図するユーザの割合は、それぞれ（「４０％」，「３０％」，「３０％」）であったとする。ところが、１１月３日に球団としてのヤクルト（ヤクルトスワローズ）が優勝し、「ヤクルト」での検索回数が急増した場合（バーストした場合）、先ほどの検索意図の割合は（「５％」，「５％」，「９０％」）になっている場合がある。このときのクエリログから抽出された関連語グループにはヤクルト球団の関連語が多数包含される。 On the day of burst detection when the number of searches for such search terms has increased temporarily, a certain search term is biased toward a specific search intention (a high percentage of specific search intentions), so the relevant relationship according to that intention Words can be extracted from the query log. For example, among the users who use the search term “Yakult” in the steady state instead of the burst detection date, the percentage of users who intend to use the information (“beverage”, “health food”, “Yakult baseball team”) Assume that they are respectively (“40%”, “30%”, “30%”). However, if Yakult (Yakult Swallows) as a team wins on November 3, and the number of searches for “Yakult” has increased rapidly (when bursting), the percentage of search intentions (“5%”, “ 5% "," 90% "). The related word group extracted from the query log at this time includes many related words of the Yakult team.

したがって、バースト検出日毎に関連語グループを作成すれば、多様な検索意図の関連語を抽出してユーザに提示でき、検索の利便性を向上させることができる。図３に基づき説明すれば、同一の検索語についてバースト１の関連語グループには意図Ｂの関連語が多く含まれているため、該意図Ｂの関連語が抽出され易い一方、バースト２の関連語グループには意図Ａの関連語が多く含まれているため、該意図Ａの関連語が抽出され易い。すなわち、バースト１．２の関連語グループは、図４に示すように、それぞれ定常状態（バースト前１週間の平均）よりも語ａ．ｂの割合が「０．３」増加している。このときバースト１から抽出した語ｂは、意図Ｂによるものである可能性が高く、バースト２から抽出した語ａは意図Ａによるものである可能性が高い。したがって、各関連語グループから抽出された関連語をユーザに提示することで、意図Ａ．Ｂの推薦語を提示することができる。 Therefore, if a related word group is created for each burst detection date, related words with various search intentions can be extracted and presented to the user, and the convenience of search can be improved. If it demonstrates based on FIG. 3, since the related word group of the burst 1 contains many related words of the intention B about the same search word, while the related word of the intention B is easy to be extracted, Since many related words of the intention A are included in the word group, the related words of the intention A are easily extracted. That is, as shown in FIG. 4, the related word group of burst 1.2 has words a. The ratio of b is increased by “0.3”. At this time, the word b extracted from the burst 1 is highly likely to be due to the intention B, and the word a extracted from the burst 2 is likely to be due to the intention A. Therefore, by presenting the related words extracted from each related word group to the user, the intention A.A. B's recommended words can be presented.

また、図５〜図７中、「キーワード」は前記検出手段４の検出した検索語を示し、「全体」は従来のガイド機能で提示される関連語群を示し、「バースト考慮」は前記抽出装置１を用いたガイド機能で提示される関連語群を示し、アンダーライン部は従来のガイド機能では提示されてない関連語を示し、前記抽出装置１を用いれば多様性のある関連語群をユーザに提示できることが分かる。 In FIG. 5 to FIG. 7, “keyword” indicates a search word detected by the detection unit 4, “whole” indicates a related word group presented by the conventional guide function, and “burst consideration” indicates the extraction The related word group presented by the guide function using the apparatus 1 is shown, the underlined part shows the related word group not presented by the conventional guide function, and if the extracting apparatus 1 is used, the related word group having diversity is shown. It can be seen that it can be presented to the user.

このとき前記抽出装置１によれば、バースト検出日のクエリログのデータのみを用いて関連語を抽出するため、多様性のある関連語をより少ないデータの解析で抽出ができ、処理の効率化にも貢献できる。なお、バースト毎に関連語グループを作成して関連語を推薦語として提示するため、ユーザは過去の流行や出来事を容易に検索でき、この点でもユーザの検索活動を支援できる。 At this time, according to the extraction apparatus 1, since related words are extracted using only the query log data of the burst detection date, a variety of related words can be extracted by analyzing less data, thereby improving processing efficiency. Can also contribute. In addition, since a related word group is created for each burst and related words are presented as recommended words, the user can easily search for past fashions and events, and in this respect can also support the user's search activities.

図８は、前記抽出手段６の他の処理例を示している。ここではバースト検出日・定常状態間で関連語のスコア情報、即ち前記ＤＢ７に保存された「ｗｏｒｄ」の「ｓｃｏｒｅ」値の差分を算出することで、さらなる効果拡大を図っている。 FIG. 8 shows another processing example of the extraction means 6. Here, the effect is further expanded by calculating the difference between the score information of the related words between the burst detection date and the steady state, that is, the difference of the “score” value of “word” stored in the DB 7.

具体的には、バースト検出日前の任意の期間（ここでは一週間とする。）を定常状態とし、定常状態の各関連語の関連度や「ｔｆ・ｉｄｆ」などのスコア平均を、バースト検出日における同じ関連語のスコア情報、即ち前記ＤＢ７に保存された「ｓｃｏｒｅ」の値から減算する。これにより定常状態で高いスコア値を得ていた関連語のスコア値は減少し、相対的にバースト検出日のみに高いスコア値の関連語が上位として抽出される。 Specifically, an arbitrary period (in this case, one week) before the burst detection date is set as the steady state, and the degree of association of each related word in the steady state and the score average such as “tf · idf” are calculated as the burst detection date. Is subtracted from the score information of the same related word in, ie, the value of “score” stored in the DB 7. Thereby, the score value of the related word which has obtained a high score value in the steady state is decreased, and the related word having a relatively high score value is extracted as a higher rank only on the burst detection date.

例えば図８中の関連語「東京ディズニーリゾート」は、定常状態において「ｔｆ・ｉｄｆ」の最高値「０．１５」を得ているものの、検索語「ディズニー（登録商標）」のバースト検出日には最低値「０．０５」なため、関連語群中の下位で抽出されている。一方、関連語「ディズニールームランプ」は、定常状態において「ｔｆ・ｉｄｆ」の最低値「０．０２」であるものの、検索語「ディズニー」のバースト日には二番目の値「０．１４」を得ているため、第２位で抽出されている。 For example, the related word “Tokyo Disney Resort” in FIG. 8 obtains the maximum value “0.15” of “tf · idf” in the steady state, but on the burst detection date of the search word “Disney (registered trademark)”. Since the minimum value is “0.05”, it is extracted in the lower order in the related word group. On the other hand, although the related word “Disney room lamp” has the lowest value “0.02” of “tf · idf” in the steady state, the second value “0.14” is displayed on the burst date of the search word “Disney”. Therefore, it is extracted in the second place.

≪第２実施形態≫
図９は、本発明の第２実施形態に係る関連語抽出装置を示している。この抽出装置１１には、前記ＤＢ７の保存データをクラスタリングするクラスタリング手段９が設けられている。ここではバーストした検索語に対する関連語グループを再グループ化する。 << Second Embodiment >>
FIG. 9 shows a related word extraction apparatus according to the second embodiment of the present invention. The extraction device 11 is provided with clustering means 9 for clustering the data stored in the DB 7. Here, the related word group for the burst search word is regrouped.

すなわち、前記ＤＢ７に保存された検索語の関連語グループ群には、同じ意図の類似した検索語を含む関連語グループが存在する場合がある。例えば検索語「京都」は毎年春にバースト（流行）が存在し、そのバースト検出日の関連語群としては「桜」や「花見」、「吉野」などが同じように抽出される。これらを別々のブループとしてラウンドロビンで関連語を抽出すると、同じ意図の関連語が多くなってしまう。 That is, there may be a related term group including similar search terms with the same intention in the related term group of search terms stored in the DB 7. For example, the search term “Kyoto” has a burst (trend) every spring, and “sakura”, “cherry blossom viewing”, “Yoshino”, etc. are extracted in the same way as related terms on the burst detection date. When these related words are extracted as round loops and related words are extracted, the number of related words having the same intention increases.

このとき検索窓などの追加語を提示するスペースは有限なため、同じ意味合いの関連語が上位に並んで多種多様な関連語をユーザに提示できないおそれが生じる。そこで、前記ＤＢ７に保存された関連語グループをクラスタリングし、類似するグループを一つのグループにまとめるためにクラスタリング手段９を前記抽出装置１に設けた。 At this time, since the space for presenting additional words such as a search window is limited, there is a possibility that related words having the same meaning are arranged at the top and various related words cannot be presented to the user. Therefore, clustering means 9 is provided in the extraction device 1 in order to cluster related word groups stored in the DB 7 and group similar groups into one group.

このクラスタリング手段９は、前記ＤＢ７のレコードを取得し、取得されたレコード中の関連語グループ群を、該グループ内に含まれる関連語の類似性でクラスタリングする（クラスタリングステップ）。この関連語の類似性判定には、関連語の種類や前記ＤＢ７に関連語と対で記録された「ｓｃｏｒｅ」のスコア情報を利用する。クラスタリング手法としては代表的なウォード法やＫ平均法など、どの手段を用いてもよく、クラスタリングの際に作成されるクラスタ数も任意に指定してよいものとする。 The clustering means 9 acquires the records of the DB 7, and clusters related word groups in the acquired records based on the similarity of related words included in the group (clustering step). In this related word similarity determination, the type of the related word and the score information of “score” recorded in the DB 7 as a pair with the related word are used. As a clustering method, any means such as a typical Ward method or a K-average method may be used, and the number of clusters created at the time of clustering may be arbitrarily designated.

例えば検索語「京都」で２０ＸＸ年４月ｘｘ日に検出されたバーストでは、「桜」「花見」が関連語グループとして保存され、２００Ｙ年４月ｙｙ日も「桜」「花見」が関連語グループとして保存されていれば同じ意図の関連語と考えられる。その際に関連語「桜」「花見」の種類やそれぞれの関連度（ｔｆ・ｉｄｆなど）をスコア情報として、その分布の傾向などによって各関連語グループをクラスタリングする。 For example, in a burst detected on April xx 20XX in the search term “Kyoto”, “Sakura” and “Hanami” are stored as related word groups, and “Sakura” and “Hanami” are also related words on April yy, 200Y. If they are stored as a group, they are considered related terms with the same intention. At that time, the types of related words “sakura” and “cherry-blossom viewing” and the degree of relevance (tf / idf etc.) are used as score information, and each related word group is clustered according to its distribution tendency.

また、前述のようにヤクルト球団（ヤクルトスワローズ）が優勝し、最初に検索語「ヤクルト」がバーストした場合、バースト検出日以降に同じ意図の関連語グループ群が連続すれば、同様に該各関連語グループをクラスタリングする。 As described above, if the Yakult Swallows wins and the search term “Yakult” bursts for the first time, if related word groups of the same intention continue after the burst detection date, Cluster word groups.

そして、各クラスタに含まれるレコードの関連語グループを統合して、関連度の高い順に関連語のリストをクラスタ毎に作成する。このとき統合する関連語グループ内の関連語が重複していれば、関連度はスコア情報の合計値を用いてもよく、最大値や中央値・平均値を用いてもよく、その他の方法を用いてもよい。 Then, related word groups of records included in each cluster are integrated, and a list of related words is created for each cluster in descending order of relevance. If the related terms in the related term group to be integrated at this time are duplicated, the total value of the score information may be used as the relevance level, the maximum value, the median value, or the average value may be used. It may be used.

図１０に基づき説明すれば、Ｇ１〜Ｇ４は前記ＤＢ７のレコード、即ちある検索語の関連語グループを示している。ここではＧ１．Ｇ２内の各関連語は、共に検索意図Ａ（yy/mm/dd）を持つため、スコア分布などが類似し、クラスタリングの結果、Ｇ１．Ｇ２が統合されている。 If it demonstrates based on FIG. 10, G1-G4 has shown the record of the said DB7, ie, the related word group of a certain search term. Here, G1. Since each related word in G2 has a search intention A (yy / mm / dd), the score distribution is similar, and as a result of clustering, G1. G2 is integrated.

そうすると、クラスタリングの結果、関連語グループ群がクラスタ毎に再構築されるため、前記ＤＢ７の当該レコードを更新して保存する。ここで更新された前記ＤＢ７から前記出力手段８が、ラウンドロビン形式で重複無く順に関連語を選択し、任意数の関連語を検索エンジン２のフロントエンドに返答する。 Then, as a result of clustering, the related word group group is reconstructed for each cluster, so that the record in the DB 7 is updated and stored. The output means 8 from the DB 7 updated here selects related words in order without duplication in a round robin format, and returns an arbitrary number of related words to the front end of the search engine 2.

このように同じ意図の関連語グループが統合されて再構築されることから、バースト検出日毎に作成した関連語グループが多くなりすぎたり、同じ意図の関連語グループが多数生じることが防止される。したがって、検索窓などの追加語の提示スペースに関連語群を有効に表示でき、検索エンジンでの表示に利用し易くなる。 Since related word groups having the same intention are integrated and reconstructed in this way, it is possible to prevent an excessive number of related word groups created for each burst detection date or a large number of related word groups having the same intention. Therefore, the related word group can be effectively displayed in the additional word presentation space such as the search window, and can be easily used for display in the search engine.

また、クラスタリングの結果、再構築される関連語グループ群には、同じ意図の関連語がまとめられていることから、関連語グループ毎に関連語を推薦語として検索窓などに表示すれば、多様な検索意図の推薦語をユーザに提示できる。このとき検索窓などのスペースが有限なことに鑑み、推薦語数が閾値（任意数）を超えている場合には関連語グループ単位（クラスタ単位）で代表の関連語を推薦語として表示してもよい。 In addition, related word groups that are reconstructed as a result of clustering contain related words with the same intention, so if you display related words as recommended words for each related word group in the search window, etc. It is possible to present recommended words intended for a search to a user. At this time, considering that the space of the search window is limited, if the number of recommended words exceeds the threshold (arbitrary number), the representative related words may be displayed as recommended words in related word group units (cluster units). Good.

（１）他の処理例１
クラスタリング手段９は、前記ＤＢ７に格納された関連語グループ単位ではなく、関連語単位でクラスタリングすることもできる。このクラスタリングには前記ＤＢ３に保存されたクリックログを用いる。 (1) Other processing example 1
The clustering means 9 can also perform clustering by related word unit instead of the related word group unit stored in the DB 7. For this clustering, click logs stored in the DB 3 are used.

すなわち、クラスタリング手段９は、前記ＤＢ７から各レコード（検索語と関連語グループ群）を抽出する。抽出されたレコード毎に検索語・関連語のいずれからもクリックされたクリック先の情報を前記ＤＢ３のクリックログから取得し、取得されたクリック先の情報にて関連語群をクラスタリングする。このクリック先の情報としては、例えばクリック先ＵＲＬのホスト名やパス名、あるいはクリック先ＵＲＬのページ情報（高出現頻度の単語など）を用いることができる。これらの情報が共通していれば同じクラスタに関連語が配置され、「ｓｃｏｒｅ」のスコア値順に関連語のリストがクラスタ毎に作成される。 That is, the clustering means 9 extracts each record (search word and related word group group) from the DB 7. Information on the click destination clicked from both the search word and the related word for each extracted record is acquired from the click log of the DB 3, and the related word group is clustered by the acquired click destination information. As the click destination information, for example, the host name or path name of the click destination URL, or page information of the click destination URL (words with high appearance frequency, etc.) can be used. If these pieces of information are common, related words are arranged in the same cluster, and a list of related words is created for each cluster in the order of the score value of “score”.

これによりクラスタリングの結果、関連語グループがクラスタ毎に再構築され、検索語と各クラスタとに前記ＤＢ７のレコードが更新され、前述と同様の効果が得られる。ここでも推薦語数が閾値（任意数）を超えている場合には、関連語グループ単位で代表の関連語を推薦語として表示できるものとする。 As a result of the clustering, the related word group is reconstructed for each cluster, the record of the DB 7 is updated for the search word and each cluster, and the same effect as described above can be obtained. Here, when the number of recommended words exceeds a threshold value (arbitrary number), the representative related words can be displayed as recommended words in units of related word groups.

（２）他の処理例２
クラスタリング手段９は、前記ＤＢ７の関連語グループ群を関連語の類似性ではなく、各関連語グループのクリック先の情報でクラスタリングすることもできる。 (2) Other processing example 2
The clustering means 9 can also cluster the related word group groups in the DB 7 not by the similarity of the related words but by the click destination information of each related word group.

ここでは関連語グループ内の各関連語からクリックされたクリック先の情報を前記ＤＢ３のクリックログから取得する。取得されたすべてのクリック先の情報で関連語グループ群をクラスタリングしてもよく、それぞれの関連語グループを代表するクリック先の情報でクラスタリングをしてもよい。代表するクリック先としては、例えばクリック回数が上位（事前に定められた順位以上）のクリック先を選定することができる。 Here, information on the click destination clicked from each related word in the related word group is acquired from the click log of the DB 3. The related word group group may be clustered with all the acquired click destination information, or the click destination information representing each related word group may be clustered. As a representative click destination, for example, a click destination with the highest number of clicks (above a predetermined rank) can be selected.

このときクリック先の情報としては、前記処理例１と同様にクリック先ＵＲＬのホスト名やパス名、あるいはクリック先ＵＲＬのページ情報（例えば高出現頻度の単語など）を用いることができ、これらの情報が共通する各関連語グループが統合され、関連度の高い順に関連語のリストがクラスタ毎に作成される。 At this time, as the click destination information, the host name or path name of the click destination URL, or page information of the click destination URL (for example, a word having a high appearance frequency) can be used as in the processing example 1. Related word groups having common information are integrated, and a list of related words is created for each cluster in descending order of the degree of relevance.

これによりクラスタリングの結果、関連語グループ群がクラスタ毎に再構築され、再構築された関連語グループ（クラスタ）に前記ＤＢ７のレコードが更新され、前述と同様の効果が得られる。ここでも推薦語数が閾値（任意数）を超えている場合には、関連語グループ単位で代表の関連語を推薦語として表示できる。 As a result of the clustering, the related word group group is reconstructed for each cluster, the record of the DB 7 is updated to the reconstructed related word group (cluster), and the same effect as described above can be obtained. Again, if the number of recommended words exceeds a threshold (arbitrary number), the representative related words can be displayed as recommended words in units of related word groups.

≪プログラムなど≫
本発明は、前記抽出装置１．１１の各手段４〜９の一部もしくは全部として、コンピュータを機能させる関連語抽出プログラムとして構成することもできる。この関連語抽出プログラムによれば、前記各ステップの一部あるいは全部をコンピュータに実行させることが可能となる。 ≪Programs≫
The present invention can also be configured as a related word extraction program that causes a computer to function as a part or all of the means 4 to 9 of the extraction device 1.11. According to this related word extraction program, a part or all of the steps can be executed by a computer.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１．１１…関連語抽出装置
２…検索エンジン
３…検索ログＤＢ（データベース）
４…バースト検出手段（検出手段）
５…バースト情報ＤＢ（データベース）
６…関連語抽出手段
７…バースト関連語情報ＤＢ（データベース）
８…関連語出力手段
９…クラスタリング手段 1.11 ... Related term extraction device 2 ... Search engine 3 ... Search log DB (database)
4 ... Burst detection means (detection means)
5 ... Burst information DB (database)
6 ... Related word extraction means 7 ... Burst related word information DB (database)
8 ... Related word output means 9 ... Clustering means

Claims

A related word extraction device that extracts related words related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection means for analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group related to the search word detected by the detecting means is extracted from the search log, the extracted related word group is grouped for each analysis unit, and the related word group is stored in the database together with the search word. Extraction means;
A related word output means for sequentially selecting related words of user input search words from the database in response to a request of the search engine, and outputting the selected related word group to the search engine;
A related word extraction device comprising:

A related word extraction device that extracts related words related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection means for analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group related to the search word detected by the detecting means is extracted from the search log, the extracted related word group is grouped for each analysis unit, and the related word group is stored in the database together with the search word. Extraction means;
Clustering means for clustering groups stored in the database with the similarity of each related word in each group, integrating each group, and updating the database,
Related word output means for sequentially selecting related words of user input search words from the database updated by the clustering means in response to a request from the search engine, and outputting the selected related word group to the search engine;
A related word extraction device comprising:

A related word extraction device that extracts related words related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection means for analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group related to the search word detected by the detecting means is extracted from the search log, the extracted related word group is grouped for each analysis unit, and the related word group is stored in the database together with the search word. Extraction means;
Clicked information from the search term and related terms in the database is extracted from the click log in the search log, and the related terms in the database are clustered according to the extracted clicked information, and the database is updated. Clustering means to
Related word output means for sequentially selecting related words of user input search words from the database updated by the clustering means in response to a request from the search engine, and outputting the selected related word group to the search engine;
A related word extraction device comprising:

A related word extraction device that extracts related words related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection means for analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group related to the search word detected by the detecting means is extracted from the search log, the extracted related word group is grouped for each analysis unit, and the related word group is stored in the database together with the search word. Extraction means;
Click destination information clicked from each related term in the group saved in the database is extracted from the click log in the search log, and the group of groups is clustered and integrated with the extracted click destination information, and the database record A clustering means for updating
Related word output means for sequentially selecting related words of user input search words from the database updated by the clustering means in response to a request from the search engine, and outputting the selected related word group to the search engine;
A related word extraction device comprising:

A related word extraction method executed by a device that extracts a related word related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection step of analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group extracted from the search log related word groups related to the search word detected in the detecting step, the extracted related word groups are grouped for each analysis unit, and the group words are stored in the database together with the search words. An extraction step;
A related word output step of selecting related words of a search word input by a user from a database in order according to a request of the search engine without duplication, and outputting the selected related word group to the search engine;
The related word extraction method characterized by having.

A related word extraction method executed by a device that extracts a related word related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection step of analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group extracted from the search log related word groups related to the search word detected in the detecting step, the extracted related word groups are grouped for each analysis unit, and the group words are stored in the database together with the search words. An extraction step;
A clustering step of clustering groups stored in the database according to the similarity of each related word in each group, integrating each group, and updating the database;
A related word output step of sequentially selecting related words of the search word input by the user from the database updated in the clustering step according to a request of the search engine, and outputting the selected related word group to the search engine;
The related word extraction method characterized by having.

A related word extraction method executed by a device that extracts a related word related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection step of analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group extracted from the search log related word groups related to the search word detected in the detecting step, the extracted related word groups are grouped for each analysis unit, and the group words are stored in the database together with the search words. An extraction step;
Clicked information from the search term and related terms in the database is extracted from the click log in the search log, and the related terms in the database are clustered according to the extracted clicked information, and the database is updated. A clustering step to
A related word output step of sequentially selecting related words of the search word input by the user from the database updated in the clustering step according to a request of the search engine, and outputting the selected related word group to the search engine;
The related word extraction method characterized by having.

A related word extraction method executed by a device that extracts a related word related to a search word based on a search log of a search engine in advance and returns a related word of a user input search word in response to a request from the search engine,
A detection step of analyzing a search log in an arbitrary unit, and detecting a search term in which the number of searches increases at a rate equal to or greater than a threshold for each analysis unit;
A related word group extracted from the search log related word groups related to the search word detected in the detecting step, the extracted related word groups are grouped for each analysis unit, and the group words are stored in the database together with the search words. An extraction step;
Click destination information clicked from each related term in the group saved in the database is extracted from the click log in the search log, and the group of groups is clustered and integrated with the extracted click destination information, and the database record The related words that select the related words of the user input search words in order without duplication from the database updated in the clustering step according to the request of the search engine, and output the selected related words to the search engine An output step;
The related word extraction method characterized by having.

The related word extraction program which makes a computer function as each means of the related word extraction apparatus of Claims 1-4.