JP5442543B2

JP5442543B2 - Content similarity calculation device and content similarity calculation method

Info

Publication number: JP5442543B2
Application number: JP2010151168A
Authority: JP
Inventors: 勇二森; 武史長沼; 靖近藤
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2010-07-01
Filing date: 2010-07-01
Publication date: 2014-03-12
Anticipated expiration: 2030-07-01
Also published as: JP2012014518A

Description

本発明は、コンテンツ間類似度算出装置及びコンテンツ間類似度算出方法に関する。 The present invention relates to an inter-content similarity calculation apparatus and an inter-content similarity calculation method.

従来から、あるコンテンツに関連性のあるコンテンツを算出する装置として、Amazonに代表されるＥＣ（Electronic Commerce）サイト等での協調フィルタリングを用いたコンテンツ間関連性算出装置が知られている。その他、メタデータ等のコンテンツの内容を表すものを比較することで関連性を算出するコンテンツフィルタリングを用いたコンテンツ間関連性算出装置も知られている（特許文献１参照）。 2. Description of the Related Art Conventionally, an inter-content relevance calculation device using collaborative filtering on an EC (Electronic Commerce) site represented by Amazon is known as a device that calculates content related to a certain content. In addition, an inter-content relevance calculation device using content filtering that calculates relevance by comparing contents representing content such as metadata is also known (see Patent Document 1).

特開２００８−１３４９４１号公報JP 2008-134941 A

T.Joachims, "Optimizing Search Engines Using Clickthrough Data", Proc.ACM SIGKDD International Conf. Knowledge Discovery and Data Mining(KDD02), ACM Press, pp.132-142, 2002.T. Joachims, "Optimizing Search Engines Using Clickthrough Data", Proc. ACM SIGKDD International Conf. Knowledge Discovery and Data Mining (KDD02), ACM Press, pp. 132-142, 2002.

一般に、協調フィルタリングは、ユーザによる各コンテンツへの評価や、閲覧履歴又は購買履歴を大量に必要とするという問題点がある。また、コンテンツフィルタリングは似たような内容のコンテンツのみが関連性の高いものとして算出される為、ユーザが興味のある領域に関して既知のものが検索される可能性が高く、新たな発見が少ないという問題点がある。 In general, collaborative filtering has a problem that a user needs to evaluate each content and a large amount of browsing history or purchasing history. In addition, content filtering is calculated as only relevant content with high relevance, so there is a high possibility that a known one is searched for an area of interest to the user, and there are few new discoveries. There is a problem.

AndroidやWindows Mobile等のオープンプラットフォーム基盤におけるコンテンツ配信システムにおいて、日々新規コンテンツが追加される中で膨大な数のコンテンツの中からユーザが潜在的に求めるコンテンツを効率良く配信する方法が求められる。 In a content distribution system based on an open platform such as Android and Windows Mobile, a method for efficiently distributing content that a user potentially desires from a huge number of contents is required as new content is added every day.

特に携帯端末のコンテンツにおいては、コンテンツ間の関連性として機能の類似性だけでなく、コンテンツが利用される状況の類似性も考えられる。例えば、「天気予報」というコンテンツと「世界の天気」というコンテンツとは、機能の類似性があると考えられる。一方、「路線案内」というコンテンツと「グルメガイド」というコンテンツとは、コンテンツが利用される状況の類似性があると考えられる。この理由は、「グルメガイド」で見つけた飲食店に行くために、同じ時間帯及び同じエリアで「路線案内」を利用する可能性があるからである。 In particular, in the content of a mobile terminal, not only the functional similarity but also the similarity of the situation in which the content is used as the relevance between the contents. For example, the content “weather forecast” and the content “world weather” are considered to have similar functions. On the other hand, it is considered that the content of “route guidance” and the content of “gourmet guide” have similarities in the situation where the content is used. This is because there is a possibility of using “route guidance” in the same time zone and the same area in order to go to the restaurant found in “Gourmet Guide”.

そこで、本発明は、コンテンツ配信システムにおいてユーザが潜在的に求めるコンテンツを効率良く配信するための手段として、利用状況の類似性を考慮に入れてコンテンツ間の類似度を算出することを目的とする。 Therefore, the present invention has an object to calculate the similarity between contents in consideration of the similarity of the usage situation as a means for efficiently distributing the content that the user potentially seeks in the content distribution system. .

本発明のコンテンツ間類似度算出装置は、
コンテンツ間の類似度を算出するコンテンツ間類似度算出装置であって、
コンテンツが携帯端末で起動したときの利用履歴を携帯端末から受信する利用履歴受信部と、
前記利用履歴受信部で受信した利用履歴を、予め設定された利用状況に集計する利用履歴集計部と、
前記利用履歴集計部で集計された利用履歴から、コンテンツ毎に利用状況の特徴ベクトルを生成する利用状況特徴ベクトル化部と、
利用状況の特徴ベクトルに基づいて、コンテンツ間の類似度を算出する類似度算出部と、
を有することを特徴とする。 The inter-content similarity calculation apparatus of the present invention includes:
An inter-content similarity calculation device that calculates the similarity between contents,
A usage history receiver that receives usage history from the mobile device when the content is activated on the mobile device;
A usage history totaling unit that counts the usage history received by the usage history receiving unit into a preset usage status;
A usage status feature vectorization unit that generates a usage status feature vector for each content from the usage history tabulated by the usage history tabulation unit;
A similarity calculation unit that calculates the similarity between contents based on the feature vector of the usage situation;
It is characterized by having.

本発明のコンテンツ間類似度算出方法は、
コンテンツ間の類似度を算出するコンテンツ間類似度算出装置におけるコンテンツ間類似度算出方法であって、
コンテンツが携帯端末で起動したときの利用履歴を携帯端末から受信するステップと、
前記利用履歴受信部で受信した利用履歴を、予め設定された利用状況に集計するステップと、
前記利用履歴集計部で集計された利用履歴から、コンテンツ毎に利用状況の特徴ベクトルを生成するステップと、
利用状況の特徴ベクトルに基づいて、コンテンツ間の類似度を算出するステップと、
を有することを特徴とする。 The content similarity calculation method according to the present invention includes:
An inter-content similarity calculation method in an inter-content similarity calculation device for calculating the similarity between contents,
Receiving usage history from the mobile device when the content starts on the mobile device,
Totalizing the usage history received by the usage history receiving unit into a preset usage status;
Generating a usage state feature vector for each content from the usage history tabulated by the usage history tabulation unit;
Calculating a similarity between contents based on a feature vector of usage status;
It is characterized by having.

本発明の実施例によれば、利用状況の類似性を考慮に入れてコンテンツ間の類似度を算出することが可能になる。 According to the embodiment of the present invention, it is possible to calculate the similarity between contents in consideration of the similarity of the usage situation.

本発明の実施例に係る通信システムの構成図1 is a configuration diagram of a communication system according to an embodiment of the present invention. 本発明の実施例に係る携帯端末及び検索サーバの機能ブロック図Functional block diagram of portable terminal and search server according to an embodiment of the present invention 本発明の実施例に係る携帯端末内の利用履歴管理テーブルを示す図The figure which shows the utilization log | history management table in the portable terminal which concerns on the Example of this invention. 本発明の実施例に係る検索サーバ内の利用履歴管理テーブルを示す図The figure which shows the utilization log | history management table in the search server based on the Example of this invention. 本発明の実施例に係る検索サーバ内のメタデータ管理テーブルを示す図The figure which shows the metadata management table in the search server which concerns on the Example of this invention. 本発明の実施例に係る検索サーバ内の特徴量管理テーブルを示す図The figure which shows the feature-value management table in the search server based on the Example of this invention. 本発明の実施例に係る検索サーバ内の類似度管理テーブルを示す図The figure which shows the similarity management table in the search server which concerns on the Example of this invention. 本発明の実施例に係る検索サーバにおけるコンテンツ間類似度算出方法のフローチャートThe flowchart of the similarity calculation method between contents in the search server which concerns on the Example of this invention. 利用状況を特徴ベクトル化する手順を示すフローチャートFlow chart showing the procedure for converting the usage status into a feature vector

本発明の実施例では、コンテンツ間の類似度を算出するコンテンツ間類似度算出装置が用いられる。コンテンツ間類似度算出装置は、携帯端末と通信可能な検索サーバでもよい。 In the embodiment of the present invention, an inter-content similarity calculating apparatus that calculates the similarity between contents is used. The inter-content similarity calculation device may be a search server that can communicate with the mobile terminal.

コンテンツ間類似度算出装置は、コンテンツが携帯端末で起動したときの時間情報又は位置情報等の利用履歴を携帯端末から受信する。受信した利用履歴は、予め設定された利用状況に集計される。例えば、利用履歴は、エリア毎、利用時間帯毎又は利用時間の長さ毎に集計される。集計された利用履歴から、コンテンツ毎に利用状況の特徴ベクトルが生成される。コンテンツ間の類似度は、利用状況の特徴ベクトルに基づいて算出される。なお、利用状況の特徴ベクトルとは、予め設定されたｎ個の利用状況とコンテンツとの結びつきの強さを表すｎ次元ベクトルである。 The inter-content similarity calculation device receives a usage history such as time information or position information when content is activated on a mobile terminal from the mobile terminal. The received usage history is aggregated in a preset usage status. For example, the usage history is aggregated for each area, for each usage time zone, or for each length of usage time. A feature vector of usage status is generated for each content from the totaled usage history. The similarity between contents is calculated based on a feature vector of usage status. The usage state feature vector is an n-dimensional vector representing the strength of connection between n usage states and contents set in advance.

また、メタデータの特徴ベクトルが生成され、コンテンツ間の類似度は、利用状況の特徴ベクトルとメタデータの特徴ベクトルとに基づいて算出されてもよい。メタデータの特徴ベクトルとは、コンテンツのメタデータから抽出されたｍ個の特徴語とコンテンツとの結びつきの強さを表すｍ次元ベクトルである。 Also, a metadata feature vector may be generated, and the similarity between contents may be calculated based on the usage state feature vector and the metadata feature vector. The feature vector of metadata is an m-dimensional vector representing the strength of connection between m feature words extracted from the content metadata and the content.

以下、本発明の実施例について詳細に説明する。 Examples of the present invention will be described in detail below.

＜通信システムの構成＞
まず、本実施例に係る通信システムの全体構成について説明する。図１に示すように、本発明の実施例に係る通信システムは、携帯端末１０と、検索サーバ２０と、コンテンツ管理サーバ３０と、移動通信網４０とから構成される。 <Configuration of communication system>
First, the overall configuration of the communication system according to the present embodiment will be described. As shown in FIG. 1, the communication system according to the embodiment of the present invention includes a mobile terminal 10, a search server 20, a content management server 30, and a mobile communication network 40.

検索サーバ２０は、コンテンツ間の類似度を算出するコンテンツ間類似度算出装置である。検索サーバ２０は、携帯端末１０からコンテンツの検索リクエストを受け付け、検索リクエストのコンテンツに関連性の高いコンテンツを検索し、移動通信網４０経由で携帯端末１０に検索結果を送信する。 The search server 20 is an inter-content similarity calculation device that calculates the similarity between contents. The search server 20 receives a content search request from the mobile terminal 10, searches for content highly relevant to the content of the search request, and transmits the search result to the mobile terminal 10 via the mobile communication network 40.

コンテンツ管理サーバ３０は、携帯端末１０に配信されるコンテンツを記憶及び管理する装置である。コンテンツとは、コンテンツ管理サーバ３０から携帯端末１０に提供されるサービス又は情報のことであり、例えば、アプリケーション、データ、映像、音楽、これらの組み合わせ等を含む。 The content management server 30 is a device that stores and manages content distributed to the mobile terminal 10. The content is a service or information provided from the content management server 30 to the mobile terminal 10 and includes, for example, an application, data, video, music, a combination thereof, and the like.

図２を参照して、携帯端末１０及び検索サーバ２０について更に詳細に説明する。携帯端末１０は、利用履歴送信部１０１と、利用履歴格納部１０２と、利用履歴取得部１０３と、現在地取得部１０４と、検索部１０５とから構成される。 With reference to FIG. 2, the portable terminal 10 and the search server 20 will be described in more detail. The mobile terminal 10 includes a usage history transmission unit 101, a usage history storage unit 102, a usage history acquisition unit 103, a current location acquisition unit 104, and a search unit 105.

利用履歴取得部１０３は、コンテンツを携帯端末１０で実行したときに、時間情報等の利用履歴を取得する。例えば、実行したコンテンツ名、起動日時、利用時間の長さを取得する。 The usage history acquisition unit 103 acquires usage history such as time information when the content is executed on the mobile terminal 10. For example, the name of the executed content, the start date / time, and the length of use time are acquired.

現在地取得部１０４は、コンテンツを実行したときの位置情報を取得する。例えば、携帯端末１０の緯度及び経度を取得する。 The current location acquisition unit 104 acquires position information when the content is executed. For example, the latitude and longitude of the mobile terminal 10 are acquired.

利用履歴格納部１０２は、利用履歴取得部１０３及び現在地取得部１０４から取得した内容を記憶する。 The usage history storage unit 102 stores the contents acquired from the usage history acquisition unit 103 and the current location acquisition unit 104.

利用履歴送信部１０１は、利用履歴格納部１０２に記憶された時間情報及び位置情報等の利用履歴を検索サーバ２０に送信する。 The usage history transmission unit 101 transmits usage history such as time information and position information stored in the usage history storage unit 102 to the search server 20.

検索部１０５は、検索サーバ２０に検索リクエストを送信し、また、検索サーバ２０から検索結果を受信する。検索リクエストは、コンテンツＩＤであり、検索結果は、検索リクエストのコンテンツと関連性の高いコンテンツのリストである。 The search unit 105 transmits a search request to the search server 20 and receives a search result from the search server 20. The search request is a content ID, and the search result is a list of content highly relevant to the content of the search request.

検索サーバ２０は、利用履歴受信部２０１と、エリア情報取得部２０２と、利用履歴集計部２０３と、メタデータ格納部２０４と、特徴語抽出部２０５と、メタデータ特徴ベクトル化部２０６と、利用履歴格納部２０７と、利用状況特徴ベクトル化部２０８と、特徴量格納部２０９と、類似度算出部２１０と、類似度格納部２１１と、検索部２１２とから構成される。 The search server 20 includes a usage history reception unit 201, an area information acquisition unit 202, a usage history totaling unit 203, a metadata storage unit 204, a feature word extraction unit 205, a metadata feature vectorization unit 206, a usage A history storage unit 207, a usage situation feature vectorization unit 208, a feature amount storage unit 209, a similarity calculation unit 210, a similarity storage unit 211, and a search unit 212 are configured.

利用履歴受信部２０１は、携帯端末１０から時間情報及び位置情報等の利用履歴を受信する。 The usage history receiving unit 201 receives usage history such as time information and position information from the mobile terminal 10.

エリア情報取得部２０２は、利用履歴のうち、携帯端末１０の緯度及び経度から駅、ショッピングセンター等の周辺施設情報を取得する。周辺施設情報は、検索サーバ２０内に周辺施設情報と緯度及び経度情報とを対応付けたデータベース（図示せず）を設けておき、そのデータベースから取得されてもよい。また、周辺施設情報は、外部のサーバから取得されてもよい。 The area information acquisition unit 202 acquires peripheral facility information such as a station and a shopping center from the latitude and longitude of the mobile terminal 10 in the usage history. The peripheral facility information may be acquired from a database (not shown) in which the peripheral facility information is associated with the latitude and longitude information in the search server 20. In addition, the peripheral facility information may be acquired from an external server.

利用履歴集計部２０３は、コンテンツ毎に、利用履歴受信部２０１及びエリア情報取得部２０２から取得した内容を予め設定された利用状況別に集計する。 The usage history totaling unit 203 totals the contents acquired from the usage history receiving unit 201 and the area information acquisition unit 202 for each content according to usage conditions set in advance.

利用履歴格納部２０７は、利用履歴集計部２０３で集計された内容を記憶する。 The usage history storage unit 207 stores the contents totaled by the usage history totaling unit 203.

利用状況特徴ベクトル化部２０８は、各コンテンツの利用状況を表現する利用状況特徴ベクトルを生成する。これにより、各コンテンツの代表的な利用状況が特徴ベクトル化される。 The usage status feature vectorization unit 208 generates a usage status feature vector that represents the usage status of each content. As a result, the representative usage status of each content is converted into a feature vector.

メタデータ格納部２０４は、各コンテンツの名称、カテゴリ、説明文等を含むメタデータを記憶する。 The metadata storage unit 204 stores metadata including the name, category, description, etc. of each content.

特徴語抽出部２０５は、メタデータ内の説明文からコンテンツを表す特徴的な単語を抽出する。 The feature word extraction unit 205 extracts a characteristic word representing content from the explanatory text in the metadata.

メタデータ特徴ベクトル化部２０６は、抽出された特徴語に基づいて各コンテンツの機能を表現するメタデータ特徴ベクトルを生成する。 The metadata feature vectorization unit 206 generates a metadata feature vector that represents the function of each content based on the extracted feature words.

特徴量格納部２０９は、利用状況特徴ベクトル化部２０８及びメタデータ特徴ベクトル化部２０６により算出された各コンテンツの利用状況特徴ベクトル及びメタデータ特徴ベクトルを記憶する。 The feature quantity storage unit 209 stores the usage status feature vector and the metadata feature vector of each content calculated by the usage status feature vectorization unit 208 and the metadata feature vectorization unit 206.

類似度算出部２１０は、特徴量格納部２０９に記憶された内容に基づいて利用状況の類似度及びメタデータの類似度を算出し、更に、コンテンツ間の類似度を算出する。 The similarity calculation unit 210 calculates the usage situation similarity and the metadata similarity based on the contents stored in the feature amount storage unit 209, and further calculates the similarity between the contents.

類似度格納部２１１は、類似度算出部２１０により算出された類似度を記憶する。 The similarity storage unit 211 stores the similarity calculated by the similarity calculation unit 210.

検索部２１２は、携帯端末１０からの検索リクエストを受け付け、また、類似度の高いコンテンツのリストを携帯端末１０に送信する。 The search unit 212 receives a search request from the mobile terminal 10 and transmits a content list having a high degree of similarity to the mobile terminal 10.

なお、図２では、携帯端末１０が利用履歴を検索サーバ２０に送信しているが、検索サーバ２０の利用履歴受信部２０１が携帯端末１０を監視して利用履歴を収集してもよい。 In FIG. 2, the mobile terminal 10 transmits the usage history to the search server 20, but the usage history receiving unit 201 of the search server 20 may monitor the mobile terminal 10 and collect the usage history.

図３に、利用履歴格納部１０２の利用履歴管理テーブルを示す。利用履歴格納部１０２には、携帯端末１０上で実行されたコンテンツの利用履歴毎に、コンテンツを一意に識別するコンテンツＩＤ、起動日時、緯度、経度及び利用時間の長さが記憶される。 FIG. 3 shows a usage history management table of the usage history storage unit 102. The usage history storage unit 102 stores, for each usage history of content executed on the mobile terminal 10, a content ID that uniquely identifies the content, a start date / time, a latitude, a longitude, and a length of usage time.

図４に、利用履歴格納部２０７の利用履歴管理テーブルを示す。利用履歴格納部２０７には、コンテンツＩＤ毎に、そのコンテンツが起動された利用状況が記憶され、また、その利用状況に該当するコンテンツの起動回数が記憶される。ここで、利用状況は利用状況ＩＤという形で表現される。例えば、利用状況ＩＤの各成分（S01〜S06）は、コンテンツが利用された時間帯（朝、昼、晩）、利用時間の長さ（５分以内、３０分以上等）、エリア情報（駅、飲食店、学校、映画館等）等を表す。各成分に、０（該当しない）、１（該当する）が記憶されている。例えば、昼（S02）に駅（S06）でコンテンツＩＤ＝C0001が利用された場合、(S01,S02,S03,S04,S05,S06)=(0,1,0,0,0,1)に該当する利用状況の起動回数を１だけ増やす。コンテンツの利用履歴は、コンテンツの起動回数（又はダウンロード回数）の多い順にソートされ、各コンテンツのソートされた利用履歴の最下位には、全成分が０となるレコードが挿入される。なお、利用履歴は、時間情報（利用時間帯又は利用時間の長さ）及び位置情報（エリア情報）のいずれかに基づいて集計されてもよく、時間情報及び位置情報の双方に基づいて集計されてもよい。 FIG. 4 shows a usage history management table of the usage history storage unit 207. The usage history storage unit 207 stores, for each content ID, the usage status when the content is activated, and the activation count of the content corresponding to the usage status. Here, the usage status is expressed in the form of usage status ID. For example, each component (S01 to S06) of the usage status ID includes the time zone (morning, noon, evening) when the content was used, the length of usage time (within 5 minutes, 30 minutes or more, etc.), area information (station) , Restaurant, school, movie theater, etc.). For each component, 0 (not applicable) and 1 (applicable) are stored. For example, if content ID = C0001 is used at station (S06) at noon (S02), (S01, S02, S03, S04, S05, S06) = (0,1,0,0,0,1) Increase the activation count of the corresponding usage status by one. The content usage history is sorted in descending order of the number of times content is activated (or the number of downloads), and a record in which all components are 0 is inserted at the bottom of the sorted usage history of each content. The usage history may be aggregated based on either time information (utilization time period or length of utilization time) and position information (area information), and may be aggregated based on both time information and location information. May be.

図５に、メタデータ格納部２０４のメタデータ管理テーブルを示す。メタデータ格納部２０４には、コンテンツのメタデータがコンテンツＩＤ別に記憶されている。コンテンツは、カテゴリによって分類されてもよい。 FIG. 5 shows a metadata management table of the metadata storage unit 204. The metadata storage unit 204 stores content metadata for each content ID. Content may be categorized by category.

図６に、特徴量格納部２０９の特徴量管理テーブルを示す。特徴量格納部２０９には、抽出された特徴語、利用状況ＩＤの要素数がそれぞれｍ個、ｎ個とした場合、メタデータ特徴ベクトル化部２０６により算出されたm次元ベクトル（メタデータ特徴ベクトル）と、利用状況特徴ベクトル化部２０８により算出されたn次元ベクトル（利用状況特徴ベクトル）と、コンテンツ間の類似度を測定する際に２つの特徴ベクトルの重みを決定する重み係数αが記憶されている。例えば、メタデータ特徴ベクトルは、「路線」、「駅」、「居酒屋」、「天気」、「時刻」というｍ個の特徴語のそれぞれが出現したか否か、特徴語が出現する確率、特徴語間の共起度等を表す。利用状況特徴ベクトルは、ＳＶＭ（Support Vector Machine）等により変換されたコンテンツの利用状況を表す。 FIG. 6 shows a feature quantity management table of the feature quantity storage unit 209. In the feature quantity storage unit 209, when the number of extracted feature words and usage status ID elements is m and n, respectively, an m-dimensional vector (metadata feature vector) calculated by the metadata feature vectorization unit 206 is stored. ), An n-dimensional vector (usage status feature vector) calculated by the usage status feature vectorization unit 208, and a weight coefficient α for determining the weight of the two feature vectors when measuring the similarity between the contents. ing. For example, the metadata feature vector includes whether or not each of m feature words “route”, “station”, “izakaya”, “weather”, and “time” has appeared, the probability that the feature word will appear, and the feature Indicates the co-occurrence between words. The usage status feature vector represents the usage status of the content converted by SVM (Support Vector Machine) or the like.

図７に、類似度格納部２１１の類似度管理テーブルを示す。類似度格納部２１１には、コンテンツＩＤの組み合わせ（コンテンツＩＤ１，コンテンツＩＤ２）に対して、メタデータ特徴ベクトルのコサイン類似度（メタデータ類似度）と、利用状況特徴ベクトルのコサイン類似度（利用状況類似度）と、メタデータ類似度と利用状況類似度とをαに応じて重み付けすることで算出されるコンテンツ間類似度と、コンテンツＩＤ１のコンテンツを検索キーとした検索リクエストがあった際にコンテンツＩＤ２のコンテンツが実際にダウンロードされた回数を示すダウンロードリクエスト数とが記憶されている。 FIG. 7 shows a similarity management table of the similarity storage unit 211. The similarity storage unit 211 stores the cosine similarity (metadata similarity) of the metadata feature vector and the cosine similarity (usage status) of the usage feature vector for the combination of content IDs (content ID1, content ID2). Content) when there is a search request using the content ID 1 content as a search key and the content similarity calculated by weighting the metadata similarity and usage status similarity according to α The number of download requests indicating the number of times ID2 content has actually been downloaded is stored.

＜コンテンツ間類似度算出方法＞
次に、検索サーバでのコンテンツ間類似度算出方法について説明する。図８は、本発明の実施例に係る検索サーバ２０におけるコンテンツ間類似度算出方法のフローチャートである。 <Method for calculating similarity between contents>
Next, a method for calculating the similarity between contents in the search server will be described. FIG. 8 is a flowchart of the inter-content similarity calculation method in the search server 20 according to the embodiment of the present invention.

まず、検索サーバ２０において、メタデータ格納部２０４に新しいコンテンツが追加されている場合は（ステップＳ１１）、事前準備としてメタデータの特徴ベクトルを生成する（ステップＳ１８）。具体的には、特徴語抽出部２０５は各コンテンツのメタデータから形態素解析により名詞等の単語を抽出する。抽出された名詞のうち、ＴＦＩＤＦ（term frequency inverse document frequency）の閾値処理により、ノイズとなり得るコンテンツの特徴と関係の低い一般的な名詞を判定し、排除する。例えば、格納されたメタデータが図５の内容である場合、「路線」、「駅」「居酒屋」、「天気」、「時刻」といった単語が特徴語として抽出され、「検索」、「表示」といった一般的な単語は排除されることが期待される。ここで抽出されたm個の特徴語を用いて、各コンテンツのメタデータがm次元ベクトル化される。ここで、ベクトルの各成分の決定方法として、当該特徴語が出現するか否かに応じて１（出現する）又は０（出現しない）を代入する方法、特徴語の出現確率や特徴語間の共起度を考慮して重み付けした値を代入する方法がある。算出されたm次元ベクトルは、図６に示す特徴量格納部２０９の特徴量管理テーブルにおけるメタデータ特徴ベクトルの列に記憶される。 First, in the search server 20, when new content is added to the metadata storage unit 204 (step S11), a feature vector of metadata is generated as advance preparation (step S18). Specifically, the feature word extraction unit 205 extracts words such as nouns from the metadata of each content by morphological analysis. Among the extracted nouns, general nouns having a low relationship with the feature of the content that can be noise are determined and excluded by threshold value processing of TFIDF (term frequency inverse document frequency). For example, if the stored metadata is the content of FIG. 5, words such as “route”, “station”, “izakaya”, “weather”, “time” are extracted as feature words, and “search”, “display” Such general words are expected to be excluded. Using the m feature words extracted here, the metadata of each content is converted into an m-dimensional vector. Here, as a method of determining each component of the vector, a method of substituting 1 (appears) or 0 (does not appear) depending on whether the feature word appears, appearance probability of feature words, and between feature words There is a method of assigning a weighted value in consideration of the co-occurrence degree. The calculated m-dimensional vector is stored in a column of metadata feature vectors in the feature amount management table of the feature amount storage unit 209 shown in FIG.

そして、類似度算出部２１０が、コンテンツの全組み合わせに対してメタデータから生成されたｍ次元ベクトル間のコサイン距離simAを算出する（ステップS１９）。例えば、ベクトルv_iとベクトルv_jとのコサイン距離は、以下の式（１）により算出される。 Then, the similarity calculation unit 210 calculates a cosine distance simA between m-dimensional vectors generated from the metadata for all combinations of contents (step S19). For example, the cosine distance between the vector v _i and the vector v _j is calculated by the following equation (1).

算出されたメタデータ特徴ベクトルのコサイン距離simA（メタデータ類似度）は、図７に示す類似度格納部２１１の類似度管理テーブルにおけるメタデータ類似度の列に記憶される。

The calculated cosine distance simA (metadata similarity) of the metadata feature vector is stored in the metadata similarity column in the similarity management table of the similarity storage unit 211 shown in FIG.

一方、携帯端末１０において、利用履歴取得部１０３はコンテンツの起動状況を監視し、コンテンツの起動又は終了イベントを検知し、起動したコンテンツのＩＤ、起動日時、利用時間、起動した位置の緯度及び経度を記録する。ここで、測位方法として携帯端末に内蔵されているＧＰＳモジュールを用いる方法と基地局から位置情報を取得する方法がある。利用履歴格納部１０２に図３のように記憶されたコンテンツの利用履歴は、決められたタイミングで検索サーバ２０の利用履歴受信部２０１に送信される（ステップＳ１２）。 On the other hand, in the mobile terminal 10, the usage history acquisition unit 103 monitors the activation status of the content, detects the activation or termination event of the content, and activates the content ID, activation date / time, usage time, latitude and longitude of the activated location. Record. Here, as a positioning method, there are a method using a GPS module built in a portable terminal and a method of acquiring position information from a base station. The usage history of the content stored in the usage history storage unit 102 as shown in FIG. 3 is transmitted to the usage history reception unit 201 of the search server 20 at a determined timing (step S12).

次に、利用履歴集計部２０３は利用履歴受信部２０１が受信した利用履歴に対し、緯度及び経度情報をもとにエリア情報取得部２０２に対して検索を行わせ、コンテンツが起動された位置周辺にある施設情報を求める。更に、時間帯、利用時間の長さ、エリア情報など予め設定したn個の要素からなる利用状況ＩＤ別にコンテンツ利用履歴を集計する（ステップＳ１３）。ここで、利用状況ＩＤの各要素は該当する／しないに応じて１／０が代入される。例えば、要素として時間帯が（朝、昼、夜）、利用時間が（５分以内、５〜２９分、３０〜５９分、６０分以上）、エリア情報が（駅、飲食店、映画館、コンビニエンスストア）の１１個とした場合、朝に駅構内のコンビニエンスストア内で３分コンテンツを利用した履歴は利用状況ＩＤ（1,0,0,1,0,0,0,1,0,0,1）として集計される。このように、携帯端末１０から受信した全ての履歴は利用状況ＩＤが付与され、利用履歴が利用状況ＩＤに該当する場合には、利用状況ＩＤに対応する起動回数が１だけ増やされる。 Next, the usage history totaling unit 203 causes the area information acquisition unit 202 to search the usage history received by the usage history receiving unit 201 based on the latitude and longitude information, and around the location where the content is activated Request facility information at Further, the content usage history is totaled for each usage status ID including n elements set in advance, such as the time zone, the length of usage time, and area information (step S13). Here, 1/0 is assigned to each element of the usage status ID depending on whether or not it corresponds. For example, time zone (morning, noon, night) as an element, usage time (within 5 minutes, 5-29 minutes, 30-59 minutes, 60 minutes or more), area information (station, restaurant, movie theater, In the case of 11 convenience stores, the history of using content for 3 minutes in the convenience store in the station in the morning is the usage status ID (1,0,0,1,0,0,0,1,0,0) , 1). Thus, all the histories received from the mobile terminal 10 are given usage status IDs, and when the usage history corresponds to the usage status ID, the number of activations corresponding to the usage status ID is increased by one.

集計された利用履歴は、特徴ベクトル化される（ステップＳ１４）。具体的な手順を図９に示す。集計された利用履歴は、図４に示すようにコンテンツＩＤ毎に起動回数の多い順にソートされ、利用履歴格納部２０７に記憶される（ステップＳ２０）。次に、各コンテンツの最下位部分に全ての成分が０のレコードが挿入される（ステップＳ２１）。ソートされた利用履歴は、ＳＶＭ（Support Vector Machine）により各コンテンツを代表する一つの利用状況特徴ベクトルに変換され（非特許文献１参照）、特徴量格納部２０９に記憶される（Ｓ２２）。起動回数の順番にソートすることで起動回数の多い状況がより強く反映される。利用状況特徴ベクトルは起動回数に依存するため、利用状況特徴ベクトルは、収集した利用履歴に基づいて動的に算出されてもよい。 The totaled usage history is converted into a feature vector (step S14). A specific procedure is shown in FIG. The tabulated usage histories are sorted in descending order of the number of activations for each content ID as shown in FIG. 4 and stored in the usage history storage unit 207 (step S20). Next, a record in which all components are 0 is inserted in the lowest part of each content (step S21). The sorted usage history is converted into one usage status feature vector representing each content by SVM (Support Vector Machine) (see Non-Patent Document 1) and stored in the feature amount storage unit 209 (S22). By sorting in order of the number of activations, the situation where the number of activations is large is reflected more strongly. Since the usage situation feature vector depends on the number of activations, the usage situation feature vector may be dynamically calculated based on the collected usage history.

そして、メタデータ類似度算出部２１０が、コンテンツの全組み合わせに対して利用状況から生成されたｎ次元ベクトル間のコサイン距離simBを算出する（ステップＳ１５）。このコサイン距離simBも式（１）と同様に算出される。算出された利用状況特徴ベクトルのコサイン距離simB（メタデータ類似度）は、図７に示す類似度格納部２１１の類似度管理テーブルにおける利用状況類似度の列に記憶される。 Then, the metadata similarity calculation unit 210 calculates a cosine distance simB between n-dimensional vectors generated from the usage status for all combinations of contents (step S15). This cosine distance simB is also calculated in the same manner as in equation (1). The calculated cosine distance simB (metadata similarity) of the usage status feature vector is stored in the usage status similarity column in the similarity management table of the similarity storage unit 211 shown in FIG.

ところで、図７に示すように、類似度格納部２１１にはコンテンツＩＤ１のコンテンツを検索キーとして関連検索をした際にコンテンツＩＤ２のコンテンツがダウンロードされた回数（ダウンロードリクエスト数）が記憶されている。コンテンツ間類似度算出の際の重み係数αはこのダウンロードリクエスト数に応じて、以下の式（２）により算出される（ステップＳ１６）。 By the way, as shown in FIG. 7, the similarity storage unit 211 stores the number of times content ID2 has been downloaded (the number of download requests) when a related search is performed using the content with content ID1 as a search key. The weighting coefficient α for calculating the similarity between contents is calculated by the following equation (2) according to the number of download requests (step S16).

ただし、request(i)は、コンテンツiを検索キーとした検索リクエストがあった際に他のコンテンツのダウンロードがリクエストされたダウンロードリクエスト総数であり、request(i|simA>simB)は、ダウンロードリクエスト総数のうち、simA>simBを満たす組み合わせの数である。なお、α_iの初期値を0.5としているが、これはダウンロードリクエストが閾値未満である場合には、simAとsimBとを同等に扱うことを意味している。このように、１つのリクエストによってα_iが大きく変化しないようにリクエスト総数が閾値以上になるまではα_i=0.5とする。α_iの初期値は0.5以外の値でもよい。

However, request (i) is the total number of download requests that were requested to download other content when there was a search request with content i as the search key, and request (i | simA> simB) was the total number of download requests. Of these, the number of combinations satisfying simA> simB. Note that the initial value of α _i is 0.5, which means that simA and simB are treated equally when the download request is less than the threshold. In this way, α _i = 0.5 is set until the total number of requests is equal to or greater than the threshold value so that α _i does not change significantly by one request. The initial value of α _i may be a value other than 0.5.

図７に示すコンテンツC0001の例では、C0002、C0003、C0004のコンテンツがそれぞれ５０回、２０回、３０回ダウンロードされている。コンテンツC0001の関連コンテンツ検索からのダウンロードリクエスト総数１００のうち、simA＞simBとなるリクエスト数が２０となるため、α=0.2となり、特徴量格納部２０９の重み係数αとして記憶される。αは各コンテンツに対し算出される値で、０に近づくほど利用状況の近いコンテンツが検索結果として返される。このように、αはダウンロードリクエスト数に応じて動的に変化させてもよい。 In the example of the content C0001 shown in FIG. 7, the contents of C0002, C0003, and C0004 are downloaded 50 times, 20 times, and 30 times, respectively. Of the total number 100 of download requests from the related content search for the content C0001, the number of requests satisfying simA> simB is 20, so α = 0.2, which is stored as the weighting factor α of the feature amount storage unit 209. α is a value calculated for each content, and the closer to 0, the closer the usage status is returned as a search result. Thus, α may be dynamically changed according to the number of download requests.

次に、コンテンツの全組み合わせに対し、類似度算出部２１０がsimA、simB、αからコンテンツiとコンテンツjとの間のコンテンツ間類似度sim_i,jを、以下の式（３）により算出する。 Next, for all combinations of content, the similarity calculation unit 210 calculates the inter-content similarity sim _{i, j} between the content i and the content j from simA, simB, α using the following equation (3). .

算出されたコンテンツ間類似度は、類似度格納部２１１に記憶される。図７ではC0001のコンテンツ間類似度は、α=0.5として算出された値である。上記のようにダウンロードリクエスト数に応じてα=0.2に更新された場合、C0001からC0002、C0003、C0004への各類似度は0.82、0.46、0.68となりC0002、C0004との結びつきが強くなる。つまり、C0001からの関連度検索は状況類似依存傾向となる。

The calculated similarity between contents is stored in the similarity storage unit 211. In FIG. 7, the similarity between contents of C0001 is a value calculated as α = 0.5. As described above, when α = 0.2 is updated according to the number of download requests, the similarities from C0001 to C0002, C0003, and C0004 are 0.82, 0.46, and 0.68, and the connection with C0002 and C0004 becomes strong. That is, the relevance degree search from C0001 has a situation similarity dependency tendency.

検索サーバ２０の検索部２１２は、携帯端末１０からのリクエストを受信すると、受信したコンテンツＩＤに対して類似度格納部２１２に記憶された類似度の高い順にコンテンツが並び替えられたコンテンツリストを配信する。 When receiving the request from the mobile terminal 10, the search unit 212 of the search server 20 delivers a content list in which the content is rearranged in the descending order of similarity stored in the similarity storage unit 212 with respect to the received content ID. To do.

＜実施例の効果＞
本発明の実施例によれば、利用状況の類似性を考慮に入れてコンテンツ間の類似度を算出できる。例えば、ユーザが明示的に指定したお気に入りのコンテンツに対し、機能だけでなく似たような利用状況で用いられているコンテンツを提示でき、新たな発見を提供できる。また、ユーザの潜在的な目的に合ったコンテンツに少ない手順で到達できるようになり、クライアント端末、ネットワーク設備等のリソースの利用を低減できる。さらに、コンテンツ毎にメタデータと利用状況のどちらに依存した関連コンテンツが好まれるかが検索結果に反映されるため、不適切なコンテンツの出現を抑える効果がある。 <Effect of Example>
According to the embodiment of the present invention, the similarity between contents can be calculated taking into account the similarity of the usage situation. For example, for a favorite content explicitly specified by the user, not only a function but also a content used in a similar usage situation can be presented, and a new discovery can be provided. In addition, it becomes possible to reach content that meets the user's potential purpose with few procedures, and the use of resources such as client terminals and network equipment can be reduced. In addition, whether the related content depending on the metadata or the usage status is preferred for each content is reflected in the search result, so that there is an effect of suppressing the appearance of inappropriate content.

例えば、ニュースリーダと天気予報や、メディアプレーヤとブックリーダといった機能は異なるが、似たような利用状況で使われがちなコンテンツが関連付けられるようになり、明確なイメージはないが有用なコンテンツを探すといった目的に合致した検索ができる。 For example, although functions such as news reader and weather forecast, media player and book reader are different, content that is likely to be used in similar usage situations will be associated, and there is no clear image but search for useful content Search that matches the purpose.

更に、利用状況及びメタデータの双方を特徴ベクトル化することにより、ユーザが検索キーとして既知のコンテンツをサーバに知らせることで、サーバはそのコンテンツに対して代表的な利用状況及びメタデータの類似するコンテンツのリストを配信できる。 Furthermore, by converting the usage status and metadata into feature vectors, the user informs the server of the known content as a search key, so that the server resembles typical usage status and metadata for the content. A list of contents can be distributed.

更に、αを動的に変化させることで、検索キーに対して利用状況の類似度が高いコンテンツがダウンロードされれば利用状況の類似度の重みを上げ、メタデータの類似度が高いコンテンツがダウンロードされればメタデータの類似度の重みを上げることができる。このため、利用状況に特徴のあるコンテンツと機能に特徴のあるコンテンツのどちらにも対応できる。 Furthermore, by dynamically changing α, if content with a high usage status similarity is downloaded with respect to the search key, the usage status similarity weight is increased, and content with a high metadata similarity is downloaded. If so, the weight of metadata similarity can be increased. For this reason, it is possible to deal with both contents having characteristics of usage and contents having functions.

更に、起動回数の多い順に利用履歴をソートして、ＳＶＭにより利用状況特徴ベクトルを生成することで、起動回数の多い利用状況を強く反映したコンテンツに代表的な利用状況を算出できる。 Furthermore, by sorting the usage history in descending order of the number of activations and generating a usage condition feature vector by SVM, it is possible to calculate a typical usage condition for content that strongly reflects the usage condition with a large number of activations.

説明の便宜上、本発明の実施例に係るコンテンツ間類似度算出装置は機能的なブロック図を用いて説明しているが、本発明のコンテンツ間類似度算出装置は、ハードウェア、ソフトウェア又はそれらの組み合わせで実現されてもよい。例えば、コンテンツ間類似度算出装置の各機能部がソフトウェアで実現され、コンピュータ内に実現されてもよい。また、２以上の実施例及び実施例の各構成要素が必要に応じて組み合わせて使用されてもよい。 For convenience of explanation, the inter-content similarity calculation apparatus according to the embodiment of the present invention has been described using a functional block diagram. However, the inter-content similarity calculation apparatus of the present invention may be hardware, software, or their It may be realized in combination. For example, each function unit of the content similarity calculation device may be realized by software and may be realized in a computer. In addition, two or more embodiments and each component of the embodiments may be used in combination as necessary.

以上、本発明の実施例について説明したが、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々の変更・応用が可能である。 As mentioned above, although the Example of this invention was described, this invention is not limited to said Example, A various change and application are possible within a claim.

１０携帯端末
１０１利用履歴送信部
１０２利用履歴格納部
１０３利用履歴取得部
１０４現在地取得部
１０５検索部
２０検索サーバ
２０１利用履歴受信部
２０２エリア情報取得部
２０３利用履歴集計部
２０４メタデータ格納部
２０５特徴語抽出部
２０６メタデータ特徴ベクトル化部
２０７利用履歴格納部
２０８利用状況特徴ベクトル化部
２０９特徴量格納部
２１０類似度算出部
２１１類似度格納部
２１２検索部 DESCRIPTION OF SYMBOLS 10 Mobile terminal 101 Usage history transmission part 102 Usage history storage part 103 Usage history acquisition part 104 Present location acquisition part 105 Search part 20 Search server 201 Usage history reception part 202 Area information acquisition part 203 Usage history total part 204 Metadata storage part 205 Features Word extraction unit 206 Metadata feature vectorization unit 207 Usage history storage unit 208 Usage situation feature vectorization unit 209 Feature amount storage unit 210 Similarity calculation unit 211 Similarity degree storage unit 212 Search unit

Claims

An inter-content similarity calculation device that calculates the similarity between contents,
A usage history receiver that receives usage history from the mobile device when the content is activated on the mobile device;
A usage history totaling unit that counts the usage history received by the usage history receiving unit into a preset usage status;
A usage status feature vectorization unit that generates a usage status feature vector for each content from the usage history tabulated by the usage history tabulation unit;
A similarity calculation unit that calculates the similarity between contents based on the feature vector of the usage situation;
An inter-content similarity calculation device.

A metadata feature vectorization unit that generates a metadata feature vector for each content based on words included in the content metadata;
The similarity calculating unit calculates the similarity of the usage situation from the feature vector of the usage situation, calculates the similarity of the metadata from the feature vector of the metadata, and calculates the similarity of the usage situation and the similarity of the metadata. The content similarity calculation apparatus according to claim 1, wherein the similarity between contents is calculated by weighting.

The similarity calculation unit calculates a weighting coefficient used for weighting the similarity of usage status and the similarity of metadata based on the number of times content is downloaded based on the similarity between content. 2. The similarity calculation apparatus between contents described in 2.

The inter-content similarity calculation apparatus according to any one of claims 1 to 3, wherein the usage status feature vectorization unit generates a feature vector by rearranging usage statuses in descending order of activation counts for each content.

The usage history totaling unit counts the usage history received by the usage history receiving unit for each usage time zone or length of usage time when the content is activated on the mobile terminal. The content similarity calculation device according to claim 1.

From the usage history received by the usage history receiving unit, further comprising an area information acquisition unit for acquiring an area when the content is activated on the mobile terminal,
The inter-content similarity calculation apparatus according to any one of claims 1 to 5, wherein the usage history totaling unit totals the usage history received by the usage history receiving unit for each area.

The content search unit according to any one of claims 1 to 6, further comprising a content search unit that generates a list in which content is rearranged in descending order of similarity when a content search request is received from a mobile terminal. Similarity calculation device.

An inter-content similarity calculation method in an inter-content similarity calculation device for calculating the similarity between contents,
Receiving usage history from the mobile device when the content starts on the mobile device,
Totalizing the usage history received by the usage history receiving unit into a preset usage status;
Generating a usage state feature vector for each content from the usage history tabulated by the usage history tabulation unit;
Calculating a similarity between contents based on a feature vector of usage status;
The content similarity calculation method which has.