JP5801242B2

JP5801242B2 - Estimated interest score database generation apparatus, method, and program

Info

Publication number: JP5801242B2
Application number: JP2012086893A
Authority: JP
Inventors: 良彦数原; 尚樹藤田; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-04-05
Filing date: 2012-04-05
Publication date: 2015-10-28
Anticipated expiration: 2032-04-05
Also published as: JP2013218440A

Description

本発明は、情報検索の分野における推定興味度スコアデータベース生成装置及び方法及びプログラムに係り、特に、検索者が地理的範囲を地図の表示範囲や緯度経度情報を入力して検索を行うことができる文書検索サービスにおいて、その条件下で特徴的なキーワードを検索前に表示させることで検索者の検索を支援する機能を備えた推定興味度スコアデータベース生成装置及び方法及びプログラムに関する。 The present invention relates to an estimated interest score database generation apparatus, method, and program in the field of information search, and in particular, a searcher can search a geographic range by inputting a map display range and latitude / longitude information. In a document search service, the present invention relates to an estimated interest score database generation apparatus, method, and program having a function of supporting a searcher's search by displaying characteristic keywords under the conditions before the search.

従来、文書内のテキストを解析して地名情報を特定する方法がある（例えば、非特許文献1参照）。当該方法の結果を用いることで各文書がどの地域に関係しているかを解析することが可能である。これを用いることで、予め用意しておいたキーワード集合がどの地域に関係付けられた文書で出現しているか解析でき、特定の地域で推薦するキーワードを抽出することが可能と考えられる。具体的な手法としては、地域を東西・南北それぞれ200mや緯度経度で8秒毎などの固定の値で区切り（以下、「メッシュ」と記す）、メッシュ毎に関連する文書集合中の各キーワードの頻度を分析する。複数のメッシュを含むある地域において、あるキーワードの頻度が全体の頻度分布中で特徴的に高い場合、そのキーワードは当該地域での推薦すべきキーワードであると判定できる。特徴的であるかは前記と同様に全メッシュにおける平均出現頻度に対して、当該メッシュでの出現頻度が3σ（σは標準偏差）以上高い場合に特徴的と判断する方法がある。または、ポアソン確率を用いることもできる。 Conventionally, there is a method of identifying place name information by analyzing text in a document (see, for example, Non-Patent Document 1). By using the result of the method, it is possible to analyze which region each document relates to. By using this, it is possible to analyze in which document a keyword set prepared in advance appears in a document related to it, and it is considered possible to extract a recommended keyword in a specific region. The specific method is to divide the region by fixed values such as 200m each in east / west / north / south and every 8 seconds in latitude / longitude (hereinafter referred to as “mesh”). Analyze frequency. In a certain area including a plurality of meshes, if the frequency of a certain keyword is characteristically high in the overall frequency distribution, it can be determined that the keyword is a keyword to be recommended in the area. In the same way as described above, there is a method for determining whether a characteristic is characteristic when the appearance frequency in the mesh is higher than the average appearance frequency in all meshes by 3σ (σ is a standard deviation) or more. Alternatively, Poisson probability can be used.

前述の文書解析に基づく方法では、ユーザがエリアに表示されたキーワードにどの程度興味を持つかという情報を考慮していない。ユーザがある地域において提示されたキーワードを選択したという履歴が利用可能な場合、これを用いてユーザの興味度を反映したキーワード推薦を行うことを考える。例えば、図１に示すように、キーワードに対するクリックログを、当該キーワードの閲覧範囲に対しての興味カウントと見做すことで、大量のキーワード選択ログを二次元座標のヒストグラムとして用いることができる。このヒストグラム情報を用いることで、ユーザの現在の閲覧範囲に対してユーザ履歴を用いて過去のユーザの興味を反映したキーワード推薦を実現することができる。 The method based on the document analysis described above does not consider information on how much the user is interested in the keywords displayed in the area. If a history that a user has selected a keyword presented in a certain area is available, consider using this to perform keyword recommendation reflecting the degree of interest of the user. For example, as shown in FIG. 1, a large number of keyword selection logs can be used as a two-dimensional coordinate histogram by regarding a click log for a keyword as an interest count for the viewing range of the keyword. By using this histogram information, it is possible to realize keyword recommendation reflecting the past user's interest using the user history for the current viewing range of the user.

平野徹、松尾義博、菊井玄一郎、"地理的距離を用いた地名の曖昧性解消"，第70回情報処理学会全国大会，2008.Toru Hirano, Yoshihiro Matsuo, Genichiro Kikui, "Resolving ambiguity of place names using geographical distance", 70th National Convention of Information Processing Society, 2008.

しかしながら、ユーザは、図２、図３に示すように、システムを利用する際に様々なスケールで閲覧し、キーワードを選択するため、そのままスケールの異なる閲覧範囲を統一的に扱うため、キーワードに対するユーザの興味範囲を履歴から適切に抽出することができない（課題１）。 However, as shown in FIG. 2 and FIG. 3, the user browses at various scales when using the system, and selects keywords, so that the viewing range with different scales is handled as it is. Cannot be appropriately extracted from the history (Problem 1).

また、ある地域において、ユーザのキーワード選択履歴が少ない場合、図４に示すように、ヒストグラム頻度情報が不連続となる部分が発生し、適切にユーザが興味を持つ地域を推定することができない（課題２）。 In addition, when the keyword selection history of the user is small in a certain area, as shown in FIG. 4, a portion where the histogram frequency information becomes discontinuous occurs, and the area in which the user is interested cannot be estimated appropriately ( Problem 2).

これに加えて、当該キーワードに対してユーザが興味を持つ地域は複数存在することが考えられるが、キーワードによってこの数は異なるため、適切な興味分布の数を求めることができない（課題３）。 In addition to this, it is conceivable that there are a plurality of regions in which the user is interested in the keyword. However, since this number varies depending on the keyword, an appropriate number of interest distributions cannot be obtained (Problem 3).

上記の３つの課題により、キーワードに対するユーザの選択履歴を用いる際に、適切にユーザに興味度合いを反映し、キーワード推薦ができないとう課題がある。 Due to the above three problems, there is a problem that when a user's selection history for a keyword is used, the degree of interest is appropriately reflected to the user and keyword recommendation cannot be performed.

本発明は、上記の点に鑑みなされたもので、ユーザの閲覧範囲に応じてクリックログの影響を考慮して、推定興味度スコアデータベースを生成することが可能な推定興味度スコアデータベース生成装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an estimated interest score database generating device capable of generating an estimated interest score database in consideration of the influence of a click log according to a user's viewing range, and An object is to provide a method and a program.

上記の課題を解決するために、本発明（請求項１）は、地図の表示範囲や緯度経度情報に基づいて地理的範囲を検索する文書検索サービスにおいて、ユーザの興味度合いを推定する推定興味度スコアデータベース生成装置であって、
キーワード毎に、地図を分割した矩形に対するメッシュＩＤと、該メッシュＩＤに対するクリック頻度を格納したクリック頻度ヒストグラム記憶手段と、
前記クリック頻度ヒストグラム記憶手段からキーワードのメッシュＩＤとクリック頻度を取得して、二次元正規分布の推定を行うことにより、キーワードに対する二次元正規分布パラメータと、該キーワードにおける該二次元正規分布の混合比を推定し、興味分布記憶手段に格納する興味分布計算手段と、
前記興味分布記憶手段から取得した前記二次元正規分布パラメータに基づいて、それぞれのメッシュＩＤに対応する確率密度の積分値を算出して、興味度スコアデータベースの該メッシュＩＤに対応するキーワードのレコードに加算することにより、興味分布を求める興味度スコアデータベース生成手段と、を有する。 In order to solve the above problems, the present invention (Claim 1) provides an estimated interest degree for estimating the degree of interest of a user in a document search service for searching a geographical range based on a map display range and latitude / longitude information. A score database generation device,
For each keyword, a mesh ID for a rectangle obtained by dividing the map, and a click frequency histogram storage means for storing a click frequency for the mesh ID;
By obtaining the mesh ID and click frequency of the keyword from the click frequency histogram storage means and estimating the two-dimensional normal distribution, the two-dimensional normal distribution parameter for the keyword and the mixture ratio of the two-dimensional normal distribution in the keyword Interest distribution calculation means for estimating and storing the interest distribution storage means,
Based on the two-dimensional normal distribution parameter acquired from the interest distribution storage means, the integrated value of the probability density corresponding to each mesh ID is calculated, and the keyword record corresponding to the mesh ID in the interest score database is calculated. And an interest score database generating means for obtaining an interest distribution by adding.

また、本発明（請求項２）は、キーワードと該キーワードがクリックされた地理範囲を格納したキーワードクリックログ記憶手段と、
前記キーワードクリックログ記憶手段から前記キーワードと前記地理範囲を取得し、該地理範囲の大きさに応じてスコアを計算し、該地理範囲に該当するメッシュＩＤの集合を獲得し、該メッシュＩＤの集合に含まれる各メッシュＩＤについて、前記クリック頻度ヒストグラム記憶手段においてキーワード及びメッシュＩＤが一致するレコードのクリック頻度のカラムに該スコアを加えるクリック頻度ヒストグラム生成手段と、
を更に有する。
Further, the present invention (Claim 2) includes a keyword click log storage means for storing a keyword and a geographic range in which the keyword is clicked,
The keyword and the geographic range are acquired from the keyword click log storage means, a score is calculated according to the size of the geographic range, a set of mesh IDs corresponding to the geographic range is obtained , and the set of mesh IDs Click frequency histogram generating means for adding the score to the click frequency column of the record that matches the keyword and mesh ID in the click frequency histogram storage means for each mesh ID included in
It has further.

また、本発明（請求項３）は、前記興味分布計算手段において、前記クリック頻度ヒストグラム記憶手段から読み出した前記メッシュＩＤを二次元座標上の点情報に変換し、該点情報をデータ点集合に追加し、該データ点集合から構成される二次元座標上の点集合に基づいて、二次元空間における二次元正規分布パラメータを求め、該二次元正規分布の混合数と各分布の混合比を求め、該混合数に応じた二次元正規分布パラメータと該混合比を、前記興味分布記憶手段に格納する手段を含む。 In the present invention (Claim 3), in the interest distribution calculation means, the mesh ID read from the click frequency histogram storage means is converted into point information on two-dimensional coordinates, and the point information is converted into a data point set. Add two-dimensional normal distribution parameters in the two-dimensional space based on the point set on the two-dimensional coordinates composed of the data point set, and obtain the mixture number of the two-dimensional normal distribution and the mixture ratio of each distribution And means for storing the two-dimensional normal distribution parameter corresponding to the number of mixtures and the mixture ratio in the interest distribution storage means.

上記のように、本発明によれば、ユーザの閲覧範囲に応じてクリックログの影響を考慮することが可能となり、広い範囲におけるユーザのクリックの悪影響を逓減することが可能となる。また、データから二次元正規分布の推定を行うことにより、キーワードに対するクリックログが少数の場合において発生する不連続な点を解消することが可能となる。また、キーワードに対してデータからユーザの興味分布の数を適切に推定することが可能となる。これにより、キーワードに対するユーザの選択履歴を用いて高精度にユーザの興味度合いを推定することが可能となる。 As described above, according to the present invention, it is possible to consider the influence of the click log according to the viewing range of the user, and it is possible to reduce the adverse effects of the user's click in a wide range. In addition, by estimating a two-dimensional normal distribution from data, it is possible to eliminate discontinuous points that occur when the number of click logs for a keyword is small. In addition, the number of user interest distributions can be appropriately estimated from the data for the keyword. This makes it possible to estimate the degree of interest of the user with high accuracy using the user's selection history for the keyword.

閲覧範囲におけるキーワードのクリック頻度である。It is the click frequency of the keyword in the browsing range. スケールの異なる閲覧範囲におけるキーワード選択の例（その１）である。It is an example (the 1) of the keyword selection in the browsing range from which a scale differs. スケールの異なる閲覧範囲におけるキーワード選択の例（その２）である。It is an example (the 2) of the keyword selection in the browsing range from which a scale differs. キーワード選択履歴が少ない場合の例（非連続）である。It is an example (non-continuous) when there are few keyword selection histories. 本発明の一実施の形態における推定興味度スコアＤＢ生成装置の構成例である。It is a structural example of the presumed interest score DB production | generation apparatus in one embodiment of this invention. 本発明の一実施の形態におけるキーワードクリックログＤＢのデータ例である。It is an example of data of keyword click log DB in one embodiment of the present invention. 本発明の一実施の形態におけるクリック頻度ヒストグラムＤＢのデータ例である。It is an example of data of click frequency histogram DB in one embodiment of the present invention. 本発明の一実施の形態におけるクリックによるメッシュＩＤへの加算の例である。It is an example of addition to mesh ID by the click in one embodiment of this invention. 本発明の一実施の形態におけるクリック頻度ヒストグラム生成部の処理のフローチャートである。It is a flowchart of the process of the click frequency histogram production | generation part in one embodiment of this invention. 本発明の一実施の形態における興味分布推定部の処理のフローチャートである。It is a flowchart of the process of the interest distribution estimation part in one embodiment of this invention. 本発明の一実施の形態における興味分布ＤＢのデータ例である。It is an example of data of interest distribution DB in one embodiment of this invention. 本発明の一実施の形態における興味度スコアＤＢの例である。It is an example of interest score DB in one embodiment of this invention. 本発明の一実施の形態における興味度スコア生成部の処理のフローチャートである。It is a flowchart of the process of the interest score generation part in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図５は、本発明の一実施の形態における推定興味度スコアDB生成装置の構成例を示す。 FIG. 5 shows a configuration example of the estimated interest score DB generation apparatus according to the embodiment of the present invention.

同図に示す、推定興味度スコアDB生成装置は、キーワードクリックログDB１０、クリック頻度ヒストグラム生成部２０、クリック頻度ヒストグラムDB３０、興味度スコアD正生成処理部１００、興味度スコアDB７０から構成される。 The estimated interest score DB generation device shown in FIG. 1 includes a keyword click log DB 10, a click frequency histogram generation unit 20, a click frequency histogram DB 30, an interest score D positive generation processing unit 100, and an interest score DB 70.

キーワードクリックログDB１０は、図６に示すように、キーワードがどのような地理範囲でクリックされたかという情報を保持しており、キーワードカラムと地理範囲カラムから構成される。ここで、地理範囲は、例えば、ユーザが当該キーワードを閲覧していた画面の矩形範囲などを用いることができ、矩形範囲の場合は左下、右上の2点の座標を表現できる。 As shown in FIG. 6, the keyword click log DB 10 holds information on what geographic range the keyword was clicked on, and is composed of a keyword column and a geographic range column. Here, as the geographic range, for example, a rectangular range of a screen on which the user has browsed the keyword can be used, and in the case of the rectangular range, coordinates of two points on the lower left and upper right can be expressed.

クリック頻度ヒストグラム生成部２０は、キーワードクリックログDB１０を読み込み、クリック頻度ヒストグラムDB３０を出力する。 The click frequency histogram generation unit 20 reads the keyword click log DB 10 and outputs a click frequency histogram DB 30.

クリック頻度ヒストグラムDB３０は、地理情報をメッシュ単位で扱う。ここでメッシュは、例えば、地図を200ｍ四方の矩形に分割したものとする。与えられた地理範囲情報に対応するメッシュID（の集合）に変換する方法は所与のものとする。クリック頻度ヒストグラムDB３０のデータ例を図７に示す。ここで、一つのレコードがあるキーワードが、あるメッシュＩＤにおいてどの程度のクリック頻度を獲得したかということを表す。具体的には、図８に示すように、あるメッシュ上のキーワードをクリックした場合に、当該メッシュのメッシュＩＤ毎にクリック回数がカウントされる。 The click frequency histogram DB 30 handles geographic information in units of meshes. Here, for example, the mesh is obtained by dividing a map into 200 m squares. It is assumed that a method for converting to a mesh ID (set) corresponding to given geographical range information is given. An example of data in the click frequency histogram DB 30 is shown in FIG. Here, it represents how many click frequencies a keyword with one record has acquired in a certain mesh ID. Specifically, as shown in FIG. 8, when a keyword on a certain mesh is clicked, the number of clicks is counted for each mesh ID of the mesh.

図９に、本発明の一実施の形態におけるクリック頻度ヒストグラム生成部の処理フローを示す。 FIG. 9 shows a processing flow of the click frequency histogram generation unit in one embodiment of the present invention.

ステップ１０１）クリック頻度ヒストグラム生成部２０は、キーワードクリックログＤＢ１０から未処理のレコードを取得し、キーワードをｗ、地理範囲をｂとする。 Step 101) The click frequency histogram generation unit 20 acquires an unprocessed record from the keyword click log DB 10, and sets the keyword to w and the geographic range to b.

ステップ１０２）地理範囲ｂの大きさに応じて加算スコアｓを計算する。ｓの計算には、例えば、地理範囲ｂが持つ面積の逆数などを用いる。ｂ＝（lon₁，lat₁，lon₂，lat₂）という2組の緯度経度座標（単位は秒）から構成されるとき、例えば、東京測地系では、1秒が25mに換算されるため、面積を元にしたスコアは、 Step 102) The addition score s is calculated according to the size of the geographic range b. For example, the reciprocal of the area of the geographic range b is used for the calculation of s. When _{_{b = (lon 1, lat 1}} , lon 2, lat 2) that two sets of latitude and longitude coordinates (in seconds) is composed of, for example, in Tokyo Datum, since one second is converted to 25 m, The score based on area is

で計算することができる。ここで、mesh_sideはメッシュの一辺の大きさ（m単位）を表し、最小範囲が1になる正規化項の役目をしている。

Can be calculated with Here, mesh_side represents the size (m unit) of one side of the mesh, and serves as a normalization term with a minimum range of 1.

ステップ１０３）地理範囲ｂに該当するメッシュＩＤの集合を獲得し、Mとする。 Step 103) Acquire a set of mesh IDs corresponding to the geographical range b and set it to M.

ステップ１０４） Mに含まれる各ｍについて、クリック頻度ヒストグラムDB３０のキーワードｗ、メッシュＩＤがmに該当するレコードのクリック頻度カラムにステップ１０２で算出したスコアｓを加える。 Step 104) For each m included in M, the score s calculated in Step 102 is added to the click frequency column of the record corresponding to the keyword w in the click frequency histogram DB 30 and the mesh ID m.

ステップ１０５）キーワードクリックログＤＢ１０に未処理のレコードがある場合は、ステップ１０１に戻り、ない場合には処理を終了する。 Step 105) If there is an unprocessed record in the keyword click log DB 10, the process returns to Step 101, and if not, the process ends.

興味度スコアＤＢ生成処理部１００は、クリック頻度ヒストグラムＤＢ３０を読み込み、興味度スコアＤＢ７０に興味度スコアを出力する。興味度スコアＤＢ生成処理部１００は、興味分布計算部４０、興味分布ＤＢ５０、興味度スコアＤＢ生成部６０から構成される。 The interest score DB generation processing unit 100 reads the click frequency histogram DB 30 and outputs the interest score to the interest score DB 70. The interest score DB generation processing unit 100 includes an interest distribution calculation unit 40, an interest distribution DB 50, and an interest score DB generation unit 60.

興味分布計算部４０は、クリック頻度ヒストグラムＤＢ３０を読み込み、興味分布ＤＢ５０に興味分布を出力する。あるキーワードに足しうる興味分布は、複数の正規分布とその混合比で構成される。興味分布ＤＢ５０は、各キーワードに対する、分布ＩＤ、各分布の情報（パラメータ）と混合比を格納している。分布ＩＤは混合分布の番号を表し、混合比は当該キーワードにおける当該分布の混合比、パラメータは当該正規分布のパラメータ（平均、分散共分散）を表す。 The interest distribution calculation unit 40 reads the click frequency histogram DB 30 and outputs the interest distribution to the interest distribution DB 50. An interest distribution that can be added to a certain keyword is composed of a plurality of normal distributions and a mixture ratio thereof. The interest distribution DB 50 stores a distribution ID, information (parameters) of each distribution, and a mixing ratio for each keyword. The distribution ID represents a mixture distribution number, the mixture ratio represents the mixture ratio of the distribution in the keyword, and the parameter represents the normal distribution parameter (average, variance covariance).

図１０は、本発明の位置実施の形態における興味分布計算部の処理のフローチャートである。 FIG. 10 is a flowchart of processing of the interest distribution calculation unit in the position embodiment of the present invention.

ステップ２０１）興味分布計算部４０は、クリック頻度ヒストグラムＤＢ３０から未処理のレコードを取得し、ｗとする。 Step 201) The interest distribution calculation unit 40 acquires an unprocessed record from the click frequency histogram DB 30 and sets it as w.

ステップ２０２）クリック頻度ヒストグラムＤＢ３０からキーワードｗに該当するレコードのうち、未処理のメッシュＩＤとクリック頻度を取得し、キーワードｗに該当するレコードのうち未処理のメッシュＩＤとクリック頻度を取得し、それぞれをｍ，ｃとする。 Step 202) The unprocessed mesh ID and the click frequency are acquired from the record corresponding to the keyword w from the click frequency histogram DB 30, and the unprocessed mesh ID and the click frequency are acquired from the record corresponding to the keyword w. Are m and c.

ステップ２０３）ｍを2次元座標上の点情報に変換し、座標値を（ｘ₁，ｘ₂）とする。 Step 203) m is converted into point information on two-dimensional coordinates, and the coordinate values are set to (x ₁ , x ₂ ).

ステップ２０４）算出した座標値（ｘ₁，ｘ₂）をデータ集合Ｘに追加する。 Step 204) The calculated coordinate values (x ₁ , x ₂ ) are added to the data set X.

ステップ２０５）未処理のメッシュがある場合には、ステップ２０２に戻る。そうでない場合はステップ２０６に進む。 Step 205) If there is an unprocessed mesh, return to Step 202. Otherwise, go to step 206.

ステップ２０６）データ点集合から構成される2次元座標上の点集合の情報を元に、二次元空間においてデータにあわせた数の二次元正規分布のパラメータとその混合比率の推定を行う。この推定には例えば、文献「C. E. Rasmussen:, "The Infinite Gaussian Mixture Model". In Proceedings of Advances in Neural Information Processing Systems 12 (NIPS 1999), pp. 554-560, 1999.」の方法を用いることができる。この方法を用いれば、データ集合Ｘにあわせた数の二元正規分布の混合数ｋと、各分布の混合比πiと、それぞれの二次元正規分布パラメータ（平均μ₁,μ₂、分散共分散σ₁₁，σ₁₂，σ₁₁）（σ₂₁は正規分布の場合、共分散は対称であるため不要）を推定できる。 Step 206) Based on the information of the point set on the two-dimensional coordinates composed of the data point set, the number of parameters of the two-dimensional normal distribution and the mixing ratio thereof are estimated in accordance with the data in the two-dimensional space. For this estimation, for example, the method of the document “CE Rasmussen :,“ The Infinite Gaussian Mixture Model ”. In Proceedings of Advances in Neural Information Processing Systems 12 (NIPS 1999), pp. 554-560, 1999.” it can. If this method is used, the number k of binary normal distributions corresponding to the data set X, the mixture ratio πi of each distribution, and the respective two-dimensional normal distribution parameters (mean μ ₁ , μ ₂ , variance covariance) σ ₁₁ , σ ₁₂ , σ ₁₁ ) (when σ ₂₁ is a normal distribution, the covariance is symmetric and is not necessary).

ステップ２０７）ステップ２０６で求めた各分布に関する情報を興味分布ＤＢ５０に出力する。ここで混合数はキーワードによって異なるため、ステップ２０６で求めた混合数ｋの数だけ分布情報を興味分布ＤＢ５０に出力することになる。興味分布ＤＢ５０への出力の例を図１１に示す。 Step 207) Information about each distribution obtained in Step 206 is output to the interest distribution DB 50. Here, since the number of mixtures varies depending on the keyword, the distribution information is output to the interest distribution DB 50 by the number k of the mixtures obtained in step 206. An example of output to the interest distribution DB 50 is shown in FIG.

ステップ２０８）クリック頻度ヒストグラム３０に未処理のキーワードがある場合にはステップ２０１に戻る。そうでない場合には処理を終了する。 Step 208) If there is an unprocessed keyword in the click frequency histogram 30, the process returns to Step 201. If not, the process ends.

興味度スコアＤＢ生成部６０は、興味分布ＤＢ５０を読み出し、興味度スコアを興味度スコアＤＢ７０に出力する。 The interest score DB generation unit 60 reads the interest distribution DB 50 and outputs the interest score to the interest score DB 70.

興味度スコアＤＢ７０は、図１２に示すように、クリック頻度ヒストグラムＤＢ３０と同様に、あるキーワードが、あるメッシュにおいてどの程度の興味度スコアを持つかという情報を格納している。 As shown in FIG. 12, the interest score DB 70 stores information on how much interest score a certain keyword has in a mesh as in the click frequency histogram DB 30.

図１３は、本発明の一実施の形態における興味度スコア生成部の処理のフローチャートである。 FIG. 13 is a flowchart of processing of the interest score generation unit according to the embodiment of the present invention.

ステップ３０１）興味分布計算部４０は、興味分布ＤＢ５０から未処理のキーワードを取得し、ｗとする。 Step 301) The interest distribution calculation unit 40 acquires an unprocessed keyword from the interest distribution DB 50 and sets it as w.

ステップ３０２）キーワードに対応する未処理の分布情報を取得し、混合比π、正規分布のパラメータをμ₁，μ₂，σ₁₁，σ₁₂，σ₁₂とする。例えば、図１１の例では、分布ＩＤ１の場合は、パラメータμ₁＝1112773,μ₂＝112353,σ₁₁＝1233,σ₁₂=12453,σ₂₂=10224を取得する。 Step 302) Unprocessed distribution information corresponding to the keyword is acquired, and the mixture ratio π and the parameters of the normal distribution are set to μ ₁ , μ ₂ , σ ₁₁ , σ ₁₂ , and σ ₁₂ . For example, in the example of FIG. 11, in the case of distribution ID1, parameters μ ₁ = 1112773, μ ₂ = 112353, σ ₁₁ = 1233, σ ₁₂ = 12453, and σ ₂₂ = 10224 are acquired.

ステップ３０３）当該パラメータで表現される正規分布において、両側ｑパーセント点の内側に含まれるメッシュＩＤ集合を取得し、それぞれのメッシュＩＤに対応する確率密度の積分値を算出する。ここで、両側ｑパーセントとは、確率密度の積分値が全体のｑパーセントになる点のことを表し、ｑは予め設定された値とする。 Step 303) In the normal distribution expressed by the parameter, a mesh ID set included inside q percentage points on both sides is acquired, and an integrated value of probability density corresponding to each mesh ID is calculated. Here, both-side q percent represents a point where the integrated value of probability density is q percent of the whole, and q is a preset value.

ステップ３０４）取得したメッシュＩＤ集合に含まれるそれぞれのメッシュＩＤに対して、興味度スコアＤＢ７０のキーワードがｗに該当するレコードに対して、ステップ３０３で算出した積分値を加算する。 Step 304) For each mesh ID included in the acquired mesh ID set, the integration value calculated in step 303 is added to the record in which the keyword of the interest score DB 70 corresponds to w.

ステップ３０５）興味分布ＤＢ５０のキーワードがｗのレコードにおいて未処理の分布が存在する場合にはステップ３０２に戻る。そうでない場合はステップ３０６に移行する。 Step 305) If there is an unprocessed distribution in the record with the keyword w in the interest distribution DB 50, the process returns to Step 302. Otherwise, the process proceeds to step 306.

ステップ３０６）興味分布ＤＢ５０に未処理のキーワードがある場合には、ステップ３０１に戻る。そうでない場合は処理を終了する。 Step 306) If there is an unprocessed keyword in the interest distribution DB 50, the process returns to Step 301. Otherwise, the process is terminated.

上記のように、エリアタグ（キーワード）のクリック情報を閲覧範囲に基づいて2次元ヒストグラムに変換し、二次元座標上の無限混合正規分布推定方法（実際には、無限ではなくデータに見合った適切な数に収束）を用いることで、データの分布に合った興味度の分布推定が可能となり、キーワードに対する興味度スコアＤＢ生成が可能となる。 As described above, click information of area tag (keyword) is converted into 2D histogram based on viewing range, and infinite mixed normal distribution estimation method on 2D coordinates (actually, it is not appropriate but suitable for data instead of infinity) By using (convergence to a large number), it becomes possible to estimate the distribution of the degree of interest that matches the data distribution, and it is possible to generate an interest score DB for the keyword.

なお、上記の図５に示す推定興味度スコアＤＢ生成装置の各構成要素の動作をプログラムとして構築し、推定興味度スコアＤＢ生成装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能となる。 Note that the operation of each component of the estimated interest score DB generation device shown in FIG. 5 is constructed as a program and installed in a computer used as the estimated interest score DB generation device to be executed, or the network is It is possible to circulate through.

本発明は上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

１０キーワードクリックログＤＢ（データベース）
２０クリック頻度ヒストグラム生成部
３０クリック頻度ヒストグラムＤＢ（データベース）
４０興味分布計算部
５０興味分布ＤＢ（データベース）
６０興味度スコアＤＢ生成部
７０興味度スコアＤＢ（データベース）
１００興味度スコアＤＢ生成処理部 10 Keyword click log DB (database)
20 Click frequency histogram generator 30 Click frequency histogram DB (database)
40 Interest distribution calculator 50 Interest distribution DB (database)
60 Interest score DB generator 70 Interest score DB (database)
100 Interest score DB generation processing unit

Claims

In a document search service for searching a geographical range based on a map display range and latitude / longitude information, an estimated interest score database generation device for estimating the degree of interest of a user,
For each keyword, a mesh ID for a rectangle obtained by dividing the map, and a click frequency histogram storage means for storing a click frequency for the mesh ID;
By obtaining the mesh ID and click frequency of the keyword from the click frequency histogram storage means and estimating the two-dimensional normal distribution, the two-dimensional normal distribution parameter for the keyword and the mixture ratio of the two-dimensional normal distribution in the keyword Interest distribution calculation means for estimating and storing the interest distribution storage means,
Based on the two-dimensional normal distribution parameter acquired from the interest distribution storage means, the integrated value of the probability density corresponding to each mesh ID is calculated, and the keyword record corresponding to the mesh ID in the interest score database is calculated. An interest score database generating means for obtaining an interest distribution by adding,
An estimated interest score database generation device characterized by comprising:

A keyword click log storage means for storing a keyword and a geographical range where the keyword is clicked;
The keyword and the geographic range are acquired from the keyword click log storage means, a score is calculated according to the size of the geographic range, a set of mesh IDs corresponding to the geographic range is obtained , and the set of mesh IDs Click frequency histogram generating means for adding the score to the click frequency column of the record that matches the keyword and mesh ID in the click frequency histogram storage means for each mesh ID included in
The estimated interest score database generation device according to claim 1, further comprising:

The interest distribution calculating means includes:
The mesh ID read from the click frequency histogram storage means is converted into point information on two-dimensional coordinates, the point information is added to a data point set, and the point set on the two-dimensional coordinates configured from the data point set On the basis of the two-dimensional normal distribution parameter in the two-dimensional space, the mixture number of the two-dimensional normal distribution and the mixture ratio of each distribution, the two-dimensional normal distribution parameter and the mixture ratio according to the mixture number, The estimated interest score database generation apparatus according to claim 1, further comprising means for storing in the interest distribution storage means.

In a document search service for searching a geographical range based on a map display range and latitude / longitude information, an estimated interest score database generation method for estimating the degree of interest of a user,
The interest distribution calculation means obtains the mesh ID and click frequency of the keyword from the mesh ID for the rectangle obtained by dividing the map for each keyword and the click frequency histogram storage means for storing the click frequency for the mesh ID. An interest distribution calculating step for estimating a two-dimensional normal distribution parameter for the keyword and a mixture ratio of the two-dimensional normal distribution in the keyword by storing the distribution;
Based on the two-dimensional normal distribution parameter acquired from the interest distribution storage unit, the interest score database generation unit calculates an integrated value of probability density corresponding to each mesh ID, and the mesh of the interest score database An interest score database generation step for obtaining an interest distribution by adding to a record of a keyword corresponding to the ID;
An estimated interest score database generation method characterized by:

Click frequency histogram generation means
The keyword and the geographic range are acquired from the keyword click log storage means storing the keyword and the geographic range where the keyword is clicked, and a score is calculated according to the size of the geographic range, and the mesh corresponding to the geographic range Click frequency histogram generation step of acquiring a set of IDs and adding the score to the click frequency column of a record having the same keyword and mesh ID in the click frequency histogram storage means for each mesh ID included in the set of mesh IDs The
The estimated interest score database generation method according to claim 4, further comprising:

In the interest distribution calculation step,
The mesh ID read from the click frequency histogram storage means is converted into point information on two-dimensional coordinates, the point information is added to a data point set, and the point set on the two-dimensional coordinates configured from the data point set On the basis of the two-dimensional normal distribution parameter in the two-dimensional space, the mixture number of the two-dimensional normal distribution and the mixture ratio of each distribution, the two-dimensional normal distribution parameter and the mixture ratio according to the mixture number, The estimated interest score database generation method according to claim 4, which is stored in the interest distribution storage means.

Computer
The estimated interest score database production | generation program for functioning as each means of the estimated interest score database production | generation apparatus of any one of Claims 1 thru | or 3.