JP2011529600A

JP2011529600A - Method and apparatus for relating datasets by using semantic vector and keyword analysis

Info

Publication number: JP2011529600A
Application number: JP2011521074A
Authority: JP
Inventors: リュアン，ウエン; マハ，クリント・プレンティス; ヒーリー，ジェラルド・フランシス，ザ・サード; ファリス，アンドリュー・ロレンス; スタインバーグ，ガブリエル
Original assignee: TEXTWISE LLC
Current assignee: TEXTWISE LLC
Priority date: 2008-07-29
Filing date: 2008-07-29
Publication date: 2011-12-08
Also published as: EP2307951A4; EP2307951A1; CN101802776A; WO2010014082A1

Abstract

本開示は、ウェブページおよび広告を表すトレーナブル意味ベクトル（ＴＳＶ）などの一意的意味ベクトルおよび、広告およびウェブページの代表キーワードの情報を含む意味表現の解析に基づいて、ユーザによりレビューされているウェブページなどの対象データセットに文脈的に関係付けられた、広告などの１つ以上のデータセットを識別するためのシステムおよび方法を記載する。 The present disclosure describes a web that is being reviewed by a user based on an analysis of a unique semantic vector, such as a trainable semantic vector (TSV) representing web pages and advertisements, and information on representative keywords of advertisements and web pages. Systems and methods are described for identifying one or more data sets, such as advertisements, that are contextually related to a target data set, such as a page.

Description

開示の分野
本開示は文書、ウェブページ、ｅメール、検索クエリ、広告などの文脈的に関係付けられたデータセットを識別するための方法およびシステムに関し、より詳細には、データセットの一意的意味ベクトルおよび、データセットの代表キーワードの情報を含むキーワード意味表現を解析することにより対象データセットに文脈的に関係付けられたデータセットを識別するための方法およびシステムに関する。 FIELD OF DISCLOSURE The present disclosure relates to methods and systems for identifying contextually related datasets such as documents, web pages, emails, search queries, advertisements, and more particularly to the unique meaning of a dataset. The present invention relates to a method and system for identifying a data set contextually related to a target data set by analyzing a keyword semantic expression including information on vectors and representative keywords of the data set.

本開示の背景および要約
ＭｉｃｒｏｓｏｆｔＣｏｒｐｏｒａｔｉｏｎ，ＧｏｏｇｌｅＩｎｃ．、ＶｉｂｒａｎｔＭｅｄｉａまたはＹａｈｏｏ！Ｉｎｃ．により開発されたものなど、検索エンジンまたは広告配置システムが広く使用されて、ユーザによる検索クエリ入力に潜在的に関連する文書またはファイルを識別するか、あるいは文書、ｅメールメッセージ、ＲＳＳフィード、ウェブページのなどの、ユーザによって見られたもしくは操作された、または見られているもしくは操作されている１つ以上のデータセットに文脈的に関連付けられた広告を選択して表示する。 Background and Summary of this Disclosure Microsoft Corporation, Google Inc. , Vibrant Media or Yahoo! Inc. Search engines or ad placement systems are widely used, such as those developed by, to identify documents or files that are potentially relevant to a search query input by a user, or documents, email messages, RSS feeds, web pages Select and display advertisements that are viewed or manipulated by the user, or contextually associated with one or more data sets being viewed or manipulated, such as

しかしながら、既存の検索エンジンまたは広告配置システムは、開発および修正されてから数年経つが、いまだに満足行くものとは程遠い。検索結果または識別クエリは、ユーザにより入力された検索クエリ、あるいはユーザにより見られているもしくは見られた文書またはウェブページに対する十分な関連性が欠けていることが多い。 However, existing search engines or ad placement systems have been developed and modified several years ago and are still far from satisfactory. Search results or identification queries often lack sufficient relevance to the search query entered by the user, or to the document or web page being viewed or viewed by the user.

本開示は、データセットを表す一意的意味ベクトルおよびデータセットの代表キーワードの情報を含む意味表現を解析することにより、検索クエリまたはユーザにより見られているウェブページなどの対象データセットに文脈的に関係し得る、文書、ウェブページ、ｅメールなどの１つ以上のデータセットを効率的に識別する、種々の実施形態を記載する。 The present disclosure provides contextual analysis to a target dataset such as a search query or a web page viewed by a user by analyzing a semantic expression that includes information on a unique semantic vector representing the dataset and representative keywords of the dataset. Various embodiments are described that efficiently identify one or more data sets, such as documents, web pages, emails, etc., that may be relevant.

本開示による例示的方法は、１グループのデータセットからの少なくとも１つのデータセットを対象データセットに関係付けるためのデータ処理システムを制御する。各データセットまたは対象データセットは少なくとも１つのキーワードを含む。当該方法は、対象データセットを表す意味ベクトルと、グループにおける各それぞれのデータセットを表すそれぞれの意味ベクトルとにアクセスする。グループにおける各それぞれのデータセットを表す各意味ベクトルは、それぞれのデータセットにおける各少なくとも１つのキーワードと、それぞれのデータセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含む。対象データセットを表す意味ベクトルは、対象データセットにおける各少なくとも１つのキーワードと、対象データセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含み、対象データセットまたはグループにおける各それぞれのデータセットを表す意味ベクトルは、所定カテゴリの数に等しい次元を有する。グループにおける各データセットについて、対象データセットに関連付けられた意味ベクトルを、グループにおける各データセットに関連付けられた意味ベクトルと比較することにより、対象データセットとグループに
おける各データセットとの間の第１の類似性を決定するステップを含む。例示的方法はさらに、対象データセットのキーワード意味表現とグループにおける各それぞれのデータセットのキーワード意味表現とにアクセスする。対象データセットのキーワード意味表現または、グループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットまたはグループにおける各それぞれデータセットの代表キーワードを表す情報を含み、対象データセットのキーワード意味表現またはグループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットの意味ベクトルまたはグループにおける各それぞれのデータセットの意味ベクトルとは異なるように構築される。グループにおける各データセットに対して、対象データセットのキーワード意味表現とグループにおける各データセットのキーワード意味表現とを比較することにより、対象データセットとグループにおける各データセットとの間の第２の類似性を決定するステップを含む。グループにおけるデータセットの少なくとも１つが対象データセットとグループにおける各データセットとの間の第１の類似性および対象データセットとグループにおける各データセットとの間の第２の類似性に従って選択される。当該方法は、グループにおける少なくとも１つの選択されたデータセットを対象データセットに関係付ける。データセットの少なくとも１つは、対象データセットと同時にユーザに提示され得るか、または対象データセットをユーザに提示した後に提示され得る。データセットの少なくとも１つまたは対象データセットは、聴覚形態、視覚形態、ビデオ形態、触覚形態またはこれらの任意の組み合わせでユーザに提示され得る。 An exemplary method according to the present disclosure controls a data processing system for associating at least one data set from a group of data sets with a target data set. Each data set or target data set includes at least one keyword. The method accesses a semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group. Each semantic vector representing each respective data set in the group is a collection of relationships between each at least one keyword in the respective data set and a predetermined category to which each at least one keyword in the respective data set can be related. Contains information. The semantic vector representing the target data set includes collective information of a relationship between each at least one keyword in the target data set and a predetermined category to which each at least one keyword in the target data set may be related, and the target data set Or a semantic vector representing each respective data set in a group has a dimension equal to the number of predetermined categories. For each data set in the group, a first comparison between the target data set and each data set in the group is made by comparing the semantic vector associated with the target data set with the semantic vector associated with each data set in the group. Determining the similarity of. The exemplary method further accesses the keyword semantic representation of the subject dataset and the keyword semantic representation of each respective dataset in the group. The keyword meaning expression of the target data set or the keyword meaning expression of each data set in the group includes information representing the representative keyword of each data set in the target data set or group, and the keyword meaning expression or group of the target data set The keyword semantic representation of each respective data set in is constructed differently from the semantic vector of the target data set or the semantic vector of each respective data set in the group. For each data set in the group, a second similarity between the target data set and each data set in the group by comparing the keyword semantic representation of the target data set with the keyword semantic representation of each data set in the group Determining gender. At least one of the data sets in the group is selected according to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group. The method relates at least one selected data set in the group to the target data set. At least one of the data sets can be presented to the user at the same time as the subject data set, or can be presented after presenting the subject data set to the user. At least one of the data sets or the target data set may be presented to the user in an auditory form, a visual form, a video form, a tactile form, or any combination thereof.

１つの実施形態では、グループにおけるデータセットの少なくとも１つは広告であり、対象データセットは文書、ウェブページ、ｅメール、ＲＳＳニュースフィード、データストリーム、放送データもしくはユーザに関する情報または１つ以上の文書、ウェブページ、ｅメール、ＲＳＳニュースフィード、データストリーム、放送データもしくはユーザに関する情報の一部である。さらに別の実施形態によると、例示的方法は、少なくとも１つの選択されたデータセットまたは、対象データに関して選択されたデータセットに関連付けられたファイルもしくは対象データセットに関連付けられたファイルを、ユーザに伝達し得る。少なくとも１つの選択されたデータセットを表示するか、少なくとも１つの選択されたデータセットに従って音響信号を再生するか、または少なくとも１つの選択されたデータセットへのリンクを提供することにより、少なくとも１つの選択されたデータセットはユーザに伝達され得る。 In one embodiment, at least one of the data sets in the group is an advertisement, and the target data set is a document, web page, email, RSS news feed, data stream, broadcast data or information about the user or one or more documents. , Web pages, emails, RSS news feeds, data streams, broadcast data or part of user information. According to yet another embodiment, the exemplary method communicates to the user at least one selected data set or a file associated with or associated with the selected data set for the target data. Can do. Displaying at least one selected data set, playing an acoustic signal in accordance with at least one selected data set, or providing a link to at least one selected data set; The selected data set can be communicated to the user.

１つの実施形態では、少なくとも１つのキーワードは、単語、フレーズ、文字列、予め割り当てられたキーワード、サブデータセット、メタ情報、およびそれぞれのデータセットに含まれるリンクに基づいて取り出された情報のうちの少なくとも１つを含み得る。別の実施形態では、各データセットに対する意味ベクトルは予め計算されて、それぞれのデータセットに含まれる。意味ベクトルは動的にオンザフライで生成され得る。 In one embodiment, the at least one keyword is a word, phrase, character string, pre-assigned keyword, sub-data set, meta-information, and information retrieved based on links included in each data set. At least one of the following. In another embodiment, the semantic vector for each data set is pre-calculated and included in each data set. Semantic vectors can be generated dynamically on the fly.

１つの実施形態によると、グループにおける各それぞれのデータセットを表す意味ベクトルは、グループにおける各それぞれのデータセットの少なくとも１つのキーワードおよび、既知のキーワードと既知のキーワードが関係し得る所定カテゴリとの間の既知の関係に基づいて構築され、対象データセットを表す意味ベクトルは、対象データセットの少なくとも１つのキーワードと、既知のキーワードと既知のキーワードが関係し得る所定カテゴリとの間の既知の関係に基づいて構築される。別の実施形態によると、それぞれのデータセットに関連付けられた意味ベクトルは、さらに、少なくとも一人のユーザに関する情報またはそれぞれのデータセットにリンクされた少なくとも１つのデータセットに基づいて生成される。少なくとも一人のユーザに関する情報は、以前に見られた文書、以前の検索要求、ユーザの好みおよび個人情報のうちの少なくとも１つを含み得る。 According to one embodiment, the semantic vector representing each respective dataset in the group is between at least one keyword of each respective dataset in the group and a predetermined category to which a known keyword and a known keyword can relate. A semantic vector that is constructed based on the known relationships of and represents the target data set is a known relationship between at least one keyword of the target data set and a predetermined category to which the known keyword and the known keyword can relate. Built on the basis. According to another embodiment, the semantic vector associated with each data set is further generated based on information about at least one user or at least one data set linked to each data set. Information about at least one user may include at least one of previously viewed documents, previous search requests, user preferences, and personal information.

１つの実施形態によると、対象データセットとグループにおける各データセットとの間
の第１の類似性および対象データセットとグループにおける各データセットとの間の第２の類似性に従って、グループにおけるデータセットの少なくとも１つを選択するステップは、第１の類似性と第２の類似性のうちの一方を一次類似性として、また他方を二次類似性として指定するステップと、一次類似性に対する複数の事前設定関連レベルの情報にアクセスするステップと、グループにおける各データセットについて、一次類似性を一次類似性に従った事前設定関連レベルのうちの１つにマップするステップと、グループにおけるデータセットのマップされたそれぞれの事前設定関連レベルに従って、グループにおけるデータセットを格付けするステップと、各関連レベル内で、データセットの二次類似性に従って各関連レベルにデータセットを格付けするステップと、各関連レベルにデータセットを格付けした結果に従って、グループにおけるデータセットの少なくとも１つを選択するステップとを含む。 According to one embodiment, the data sets in the group according to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group. Selecting at least one of the first similarity and the second similarity as a primary similarity and the other as a secondary similarity, and a plurality of primary similarity Accessing information of a preset association level, mapping, for each dataset in the group, a primary similarity to one of the preset association levels according to the primary similarity, and mapping the dataset in the group A rating that ranks the datasets in the group according to each pre-set relevant level And within each association level, ranking the dataset to each association level according to the secondary similarity of the dataset, and at least one of the datasets in the group according to the results of ranking the dataset to each association level. Selecting.

別の実施形態によると、対象データセットとグループにおける各データセットとの間の第１の類似性および対象データセットとグループにおける各データセットとの間の第２の類似性に従って、グループにおけるデータセットの少なくとも１つを選択するステップは、第１の類似性と第２の類似性の一方を一次類似性として、また他方を二次類似性として指定するステップと、一次類似性に従って、グループにおけるデータセットを格付けするステップと、事前設定基準に従って、格付けされたデータセットから少なくとも１つの候補データセットを選択するステップと、二次類似性に従って、少なくとも１つの候補データセットを格付けするステップと、少なくとも１つの候補データセットを格付けした結果に従って、グループにおけるデータセットの少なくとも１つを選択するステップとを含む。 According to another embodiment, the datasets in the group according to a first similarity between the subject dataset and each dataset in the group and a second similarity between the subject dataset and each dataset in the group. Selecting at least one of the first similarity and the second similarity as primary similarity and the other as secondary similarity, and according to the primary similarity, the data in the group Ranking the set; selecting at least one candidate data set from the graded data set according to preset criteria; ranking at least one candidate data set according to secondary similarity; and at least one According to the results of ranking the two candidate datasets. And selecting at least one of Tasetto.

さらに別の実施形態によると、対象データセットとグループにおける各データセットとの間の第１の類似性および対象データセットとグループにおける各データセットとの間の第２の類似性に従って、グループにおけるデータセットの少なくとも１つを選択するステップは、グループにおける各データセットについて、データセットのそれぞれの第１の類似性およびデータセットのそれぞれの第２の類似性に基づいて、事前設定公式に従って複合類似性を計算するステップと、データセットのそれぞれの複合類似性に従って、グループにおけるデータセットの少なくとも１つを選択するステップとを含む。 According to yet another embodiment, the data in the group according to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group. The step of selecting at least one of the sets comprises, for each data set in the group, a composite similarity according to a preset formula based on the respective first similarity of the data set and the second second similarity of the data set. And calculating at least one of the data sets in the group according to the respective composite similarity of the data sets.

本発明の別の態様は、１グループのデータセットからの少なくとも１つのデータセットを対象データセットに関係付けるための例示的データ処理システムである。各データセットまたは対象データセットは、少なくとも１つのキーワードを含む。システムはデータを処理するように構成されるデータプロセッサと、データプロセッサによる実行時に、規定のステップを実行するようにデータプロセッサを制御する命令を記憶するように構成されるデータ記憶システムとを含む。そのステップは、対象データセットを表す意味ベクトルと、グループにおける各それぞれのデータセットを表すそれぞれの意味ベクトルとにアクセスするステップを含み：グループにおける各それぞれのデータセットを表す各意味ベクトルは、それぞれのデータセットにおける各少なくとも１つのキーワードと、それぞれのデータセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含み、対象データセットを表す意味ベクトルは、対象データセットにおける少なくとも１つのキーワードと、対象データセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含み、対象データセットまたはグループにおける各それぞれのデータセットを表す意味ベクトルは、所定カテゴリの数に等しい次元を有しており、グループにおける各データセットについて、対象データセットに関連付けられた意味ベクトルを、グループにおける各データセットに関連付けられた意味ベクトルと比較することにより、対象データセットとグループにおける各データセットとの間の第１の類似性を決定するステップと、対象データセットのキーワード意味表現とグループにおける各それぞれのデータセットのキーワード意味表現とにアクセスす
るステップとを含み、対象データセットのキーワード意味表現または、グループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットまたはグループにおけるそれぞれのデータセットの代表キーワードを表す情報を含み、対象データセットのキーワード意味表現またはグループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットの意味ベクトルまたはグループにおける各それぞれのデータセットの意味ベクトルとは異なるように構築されており、グループにおける各データセットについて、対象データセットのキーワード意味表現とグループにおける各データセットのキーワード意味表現とを比較することにより、対象データセットとグループにおける各データセットとの間の第２の類似性を決定するステップと、対象データセットとグループにおける各データセットとの間の第１の類似性と対象データセットとグループにおける各データセットとの間の第２の類似性とに従って、グループにおけるデータセットのうちの少なくとも１つを選択するステップと、グループにおける少なくとも１つの選択されたデータセットを対象データセットに関係付けるステップとを含む。 Another aspect of the invention is an exemplary data processing system for associating at least one data set from a group of data sets with a target data set. Each data set or target data set includes at least one keyword. The system includes a data processor configured to process data and a data storage system configured to store instructions that control the data processor to perform specified steps when executed by the data processor. The steps include accessing a semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group: each semantic vector representing each respective data set in the group The semantic vector representing the target data set includes collective information of a relationship between each at least one keyword in the data set and a predetermined category to which each at least one keyword in the respective data set may be related. The collective information of the relationship between the at least one keyword in and the predetermined category to which each at least one keyword in the target data set may be related, representing the respective data set in the target data set or group Kuttle has a dimension equal to the number of predetermined categories, and for each dataset in the group, by comparing the semantic vector associated with the target dataset to the semantic vector associated with each dataset in the group Determining a first similarity between the target dataset and each dataset in the group; accessing a keyword semantic representation of the target dataset and a keyword semantic representation of each respective dataset in the group; The keyword meaning expression of the target data set or the keyword meaning expression of each data set in the group includes information representing the representative keyword of each data set in the target data set or group. The word semantic representation or keyword semantic representation of each dataset in the group is constructed differently from the semantic vector of the target dataset or the semantic vector of each dataset in the group, and for each dataset in the group Determining a second similarity between the target data set and each data set in the group by comparing the keyword semantic representation of the target data set with the keyword semantic representation of each data set in the group; At least one of the datasets in the group according to a first similarity between the dataset and each dataset in the group and a second similarity between the subject dataset and each dataset in the group; Step to choose And associating at least one selected data set in the group with the target data set.

本明細書に説明される例示的システムは、１つ以上のコンピュータシステムおよび／または適切なソフトウェアを使用して実装され得る。 The example systems described herein may be implemented using one or more computer systems and / or appropriate software.

本明細書の実施形態は、データ処理システムの実行時に、機械実行されるステップを行って１グループのデータセットからの少なくとも１つのデータセットを対象データセットに関係付けるようにデータ処理システムを制御する命令を運ぶ機械読み取り可能媒体を含む。各データセットまたは対象データセットは少なくとも１つのキーワードを含む。そのステップは、対象データセットを表す意味ベクトルと、グループにおける各それぞれのデータセットを表すそれぞれの意味ベクトルとにアクセスするステップを含み、グループにおける各それぞれのデータセットを表す各意味ベクトルは、それぞれのデータセットにおける各少なくとも１つのキーワードと、それぞれのデータセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含み、対象データセットを表す意味ベクトルは、対象データセットにおける各少なくとも１つのキーワードと、対象データセットにおける各少なくとも１つのキーワードが関係し得る所定カテゴリとの間の関係の集合的情報を含み、対象データセットまたはグループにおける各それぞれのデータセットを表す意味ベクトルは、所定カテゴリの数に等しい次元を有するステップと、グループにおける各データセットに対して、対象データセットに関連付けられた意味ベクトルを、グループにおける各データセットに関連付けられた意味ベクトルと比較することにより、対象データセットとグループにおける各データセットとの間の第１の類似性を決定するステップと、対象データセットのキーワード意味表現とグループにおける各それぞれのデータセットのキーワード意味表現とにアクセスするステップとを含み、対象データセットのキーワード意味表現または、グループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットまたはグループにおけるそれぞれのデータセットの代表キーワードを表す情報を含み、対象データセットのキーワード意味表現またはグループにおける各それぞれのデータセットのキーワード意味表現は、対象データセットの意味ベクトルまたはグループにおける各それぞれのデータセットの意味ベクトルとは異なるように構築されており、グループにおける各データセットについて、対象データセットのキーワード意味表現とグループにおける各データセットのキーワード意味表現とを比較することにより、対象データセットとグループにおける各データセットとの間の第２の類似性を決定するステップと、対象データセットとグループにおける各データセットとの間の第１の類似性および対象データセットとグループにおける各データセットとの間の第２の類似性に従って、グループにおけるデータセットのうちの少なくとも１つを選択するステップと、グループにおける少なくとも１つの選択されたデータセットを対象データセットに関係付けるステップとを含む。 Embodiments herein control a data processing system to perform machine-executed steps during execution of the data processing system to relate at least one data set from a group of data sets to a target data set. Includes machine-readable media for carrying instructions. Each data set or target data set includes at least one keyword. The step includes accessing a semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group, wherein each semantic vector representing each respective data set in the group is a respective one. The semantic vector representing the target data set includes collective information of a relationship between each at least one keyword in the data set and a predetermined category to which each at least one keyword in the respective data set may be related. The collective information of the relationship between each at least one keyword in and the predetermined category to which each at least one keyword in the target data set may be related, and represents each respective data set in the target data set or group The vector has a dimension equal to the number of predetermined categories, and for each dataset in the group, the semantic vector associated with the target dataset is compared with the semantic vector associated with each dataset in the group. Determining a first similarity between the target data set and each data set in the group, and accessing a keyword semantic representation of the target data set and a keyword semantic representation of each respective data set in the group The keyword meaning expression of the target data set or the keyword meaning expression of each data set in the group includes information representing the representative keyword of each data set in the target data set or group, and the target data set Keyword semantic representation of each data set in the group or keyword semantic representation of each data set in the group is constructed differently from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Determining a second similarity between the target data set and each dataset in the group by comparing the keyword semantic representation of the target dataset with the keyword semantic representation of each dataset in the group for the set; At least one of the data sets in the group according to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group Select And associating at least one selected data set in the group with the target data set.

本開示の追加の利点および新規特徴は、下に続く説明で部分的に述べられるか、以下の
検討で当業者には部分的に明らかとなるか、または本開示の実施により分かり得る。図示され説明される実施形態は、本開示を実施するために熟考された最良の形態の例示を提供する。本明細書に記載される各特長および実施形態は、単独または他の特徴もしくは実施形態と組み合わされて実行され得る。本開示は、その精神及び範囲からまったく逸脱することなく、種々の自明な観点からの修正が可能である。図面および説明は本質的に例示的とみなされるべきであり、限定的とみなすべきでない。本開示の利点は、添付請求項において詳細に指摘される手段および組み合わせにより実現および達成され得る。 Additional advantages and novel features of the present disclosure will be set forth in part in the description which follows, or in part will be apparent to those skilled in the art from the following discussion, or may be learned by practice of the disclosure. The illustrated and described embodiments provide an illustration of the best mode contemplated for carrying out the disclosure. Each feature and embodiment described herein may be implemented alone or in combination with other features or embodiments. The present disclosure can be modified in various obvious aspects without departing from the spirit and scope thereof. The drawings and description are to be regarded as illustrative in nature and not as restrictive. The advantages of the disclosure may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.

例示的広告配置システムのブロック図である。1 is a block diagram of an exemplary advertisement placement system. 本開示による例示的広告配置システムの実施形態を示す。1 illustrates an example advertisement placement system embodiment in accordance with the present disclosure. 本開示による広告配置システムの別の実施形態の動作を図示する。6 illustrates the operation of another embodiment of an advertisement placement system according to the present disclosure. 単語とカテゴリの間の関係を示す例示的テーブルである。3 is an exemplary table showing relationships between words and categories. 図４からの単語の重要性に対応した値を図示する例示的テーブルである。5 is an exemplary table illustrating values corresponding to the importance of words from FIG. 意味空間における、図４からの単語の表現を図示する例示的テーブルである。FIG. 5 is an exemplary table illustrating a representation of words from FIG. 4 in a semantic space. 例示的広告配置システムが実装される、例示的コンピュータシステムのブロック図である。FIG. 2 is a block diagram of an example computer system in which an example advertisement placement system is implemented.

例示的実施形態の詳細な説明
本開示は、添付図面において一例として例示され、限定として図示されない。図面では、同じ参照番号表現を有する要素は、全図面を通して同じ要素を表す。 DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS The present disclosure is illustrated by way of example in the accompanying drawings and not by way of limitation. In the drawings, elements having the same reference number designation represent the same element throughout the drawings.

以下の説明では、説明目的のために、本開示の完全な理解を提供するように多くの具体的詳細が述べられる。但し、これらの具体的詳細なしで開示の概念が実践または実施され得ることは当業者には明らかであろう。他の例では、本開示を不必要に曖昧にするのを避けるために周知の構造およびデバイスはブロック図形態で示される。 In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. However, it will be apparent to those skilled in the art that the disclosed concepts may be practiced or implemented without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present disclosure.

本明細書の説明で使用されるように、用語「データセット」は人および／または機械により読み取り可能および／または理解可能である表現の集まりを示し、用語「キーワード」は、データセットのテキストまたは記号的要素、数などの１つ以上の要素を示す。例えば、データセットが文書の場合は、キーワードは、文書に含まれる１つ以上の単語、フレーズ、句読点、記号および／またはセンテンスであり得る。データセットは、複数の異なるタイプのデータセットの集まりまたはより大きいデータセットの一部であることが可能である。データセットは、別のデータセットのコンテンツを要約または記述するサマリおよび／またはタグであり得る。キーワードは、ユーザにとって直接見ることが可能であり得るか、またはそうであり得ない。例えば、キーワードは、ビデオファイルの字幕もしくは隠されたサブタイトル、オーディオファイルの歌詞、またはＷｏｒｄ文書に関連付けられたメタデータの要素の一部であり得る。キーワードが人または機械により確定または処理されることができる前に、追加の処理が行われ得る。例えば、光学的文字認識または音声認識は、人または機械によるより簡単な処理および／または認識のために、第１の形式における一定要素を第２の形式に変換するために用いられてもよい。 As used in the description herein, the term “data set” refers to a collection of expressions that are readable and / or understandable by humans and / or machines, and the term “keyword” Indicates one or more elements such as symbolic elements, numbers, etc. For example, if the data set is a document, the keywords can be one or more words, phrases, punctuation marks, symbols and / or sentences included in the document. The data set can be a collection of different types of data sets or part of a larger data set. A data set may be a summary and / or tag that summarizes or describes the contents of another data set. The keywords may or may not be directly visible to the user. For example, the keywords can be part of a metadata element associated with a subtitle or hidden subtitle of a video file, lyrics of an audio file, or a Word document. Additional processing may be performed before the keyword can be confirmed or processed by a person or machine. For example, optical character recognition or speech recognition may be used to convert certain elements in a first format to a second format for easier processing and / or recognition by a person or machine.

データセットの例は、ウェブページ、ビデオ、オーディオもしくはマルチメディアファイル、広告、ｅメール、文書、ＲＳＳフィード、マルチメディアファイル、写真、図、図面、電子コンピュータ文書、録音、放送、ビデオファイル、メタデータなど、または上記の１つ以上の集まりを含む。 Examples of datasets are web pages, video, audio or multimedia files, advertisements, emails, documents, RSS feeds, multimedia files, photos, diagrams, drawings, electronic computer documents, recordings, broadcasts, video files, metadata Or including one or more of the above.

キーワードの例は、データセットに含まれる、または関連付けられる、単語、フレーズ
、記号、用語、ハイパーリンク、メタデータ情報および／または任意の表示もしくは未表示項目を含む。本開示のコンテキストでは、「ウェブページ」は、ＭｉｃｒｏｓｏｆｔＩｎｔｅｒｎｅｔＥｘｐｌｏｒｅｒ（このコンテンツはＨＴＭＬページ、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔページ、ＸＭＬページ、ｅメールメッセージおよびＲＳＳニュースフィードを含み得るが、これらに限定しない）のようなウェブブラウザにおいて表示可能な情報の任意の組み合わせまたは集まりを示すと理解される。 Examples of keywords include words, phrases, symbols, terms, hyperlinks, metadata information and / or any displayed or hidden items that are included or associated with the dataset. In the context of this disclosure, a “web page” is a Microsoft Internet Explorer (this content may include, but is not limited to, HTML pages, Java® Script pages, XML pages, email messages, and RSS news feeds). Is understood to indicate any combination or collection of information that can be displayed in a web browser.

本開示で使用されるように、用語「対象データセット」は、それに対して例示的システムは対象データセットに文脈的に関係付けられた１つ以上のデータセットを、１グループのデータセットから識別しようとする、１つ以上のデータセットを示す。例えば、対象データセットは、検索クエリに関連する文書を見つけることを意図してユーザが入力する検索クエリ、または、本開示による例示的システムがウェブパージと共に表示するのに適切な広告を見つけようとする１つ以上のウェブページであり得る。 As used in this disclosure, the term “target dataset” refers to the exemplary system identifying one or more datasets that are contextually related to the target dataset from a group of datasets. Indicates one or more data sets to be attempted. For example, the subject data set seeks to find a search query entered by the user with the intent of finding documents related to the search query, or an advertisement suitable for display by the exemplary system according to the present disclosure with a web purge. Can be one or more web pages.

例示目的のため、以下の例は、ウェブページおよび広告を表す、トレーナブル意味ベクトル（ｔｒａｉｎａｂｌｅｓｅｍａｎｔｉｃｖｅｃｔｏｒ）（ＴＳＶ）などの一意的意味ベクトルならびに、広告およびウェブページの代表キーワードの情報を含む意味表現の解析に基づいて、ユーザによりレビューされているウェブページのような対象データセットに文脈的に関係付けられた広告など１つ以上のデータセットを識別する実施形態の動作を説明する。種々の公式および統計的操作を行って、重要または代表キーワードが他より重視されることができるようにそれらを識別することが可能である。 For illustrative purposes, the following example shows a unique semantic vector, such as a trainable semantic vector (TSV) that represents web pages and advertisements, and a semantic representation that includes information about representative keywords for advertisements and web pages. The operation of an embodiment that identifies one or more data sets, such as advertisements contextually related to a target data set, such as a web page being reviewed by a user, based on the analysis is described. Various formulas and statistical operations can be performed to identify important or representative keywords so that they can be valued more than others.

同様のアプローチおよび方法論が、異なるタイプのデータセットおよび／または対象データセットに適用し得ることが理解される。例えば、同様のアプローチを使用して、ユーザにより入力された１つ以上の検察クエリ（対象データセット）に文脈的に関係付けられた文書もしくはウェブページを識別する、または１つ以上の広告に潜在的に関係し得るウェブページを識別することが可能である。 It will be appreciated that similar approaches and methodologies may be applied to different types of data sets and / or subject data sets. For example, using a similar approach, identify documents or web pages that are contextually related to one or more prosecution queries (target datasets) entered by the user, or have potential for one or more advertisements It is possible to identify web pages that may be relevant.

トレーナブル意味ベクトル（ＴＳＶ）は、データセットの固有タイプの意味表現であり、データセットに含まれるデータポイントおよび既知のデータポイントと所定カテゴリとの間の既知の関係に基づいて生成される。トレーナブル意味ベクトルの構築および特性の詳細は、その開示が全体において参照により本明細書に組み込まれる、２０００年５月２日に出願され、“ＣＯＮＳＴＲＵＣＴＩＯＮＯＦＴＲＡＩＮＡＢＬＥＳＥＭＡＮＴＩＣＶＥＣＴＯＲＳＡＮＤＣＬＵＳＴＥＲＩＮＧ”と題された米国特許第６，７５１，６２１号、および２００５年５月１１日に出願され、ＡＤＶＥＲＴＩＳＥＭＥＮＴＰＬＡＣＥＭＥＮＴＭＥＴＨＯＤＡＮＤＳＹＳＴＥＭＵＳＩＮＧＳＥＭＡＮＴＩＣＡＮＡＬＹＳＩＳと題された米国特許出願第１１／１２６，１８４号（代理人整理番号５５６５３−０１９）に記載されている。 A trainable semantic vector (TSV) is a unique type of semantic representation of a data set and is generated based on the data points contained in the data set and the known relationships between known data points and a predetermined category. Details of the construction and properties of the trainable semantic vectors are filed on May 2, 2000, the disclosure of which is hereby incorporated by reference in its entirety, and is entitled “CONSTRACTION OF TRAINABLE SEMANTIC VECTORS AND CLUSTERING”. No. 6,751,621, and US Patent Application No. 11 / 126,184 filed May 11, 2005 and entitled ADVERTISEMENT PLAMETENT METHOD THE SYSTEM USING SEMANTIC ANALYSIS (Attorney Docket No. 55653-019) It is described in.

図１は、１グループの広告１２とウェブページ１１の少なくとも２つのタイプの意味表現、つまり、ＴＳＶと広告１２およびウェブページ１１の代表キーワードの情報を含む意味表現の解析に基づいて、１グループの広告１２から、ユーザによって見られているウェブページ１１に文脈的に関係付けられた１つ以上の広告を識別するように構成された例示的広告配置システム１０の図である。広告１２は、テキスト、音またはアニメーションなどのメディアの任意の組み合わせからなり得る。解析の結果に基づいて、システム１０は、ウェブページ１２に文脈的に関係付けられた、選択された広告を識別する整合結果を生成する。 FIG. 1 is based on an analysis of at least two types of semantic representations of a group of advertisements 12 and web pages 11, that is, a semantic representation including information on representative keywords of TSVs and advertisements 12 and web pages 11. 1 is a diagram of an example advertisement placement system 10 configured to identify one or more advertisements that are contextually related to a web page 11 being viewed by a user from advertisements 12. The advertisement 12 can consist of any combination of media such as text, sound or animation. Based on the results of the analysis, the system 10 generates a matching result that identifies the selected advertisement that is contextually related to the web page 12.

特定データセットまたはウェブページに対する１つ以上の広告の選択が、データセットが提示されたとき、またはデータセットがユーザに提示された後、または前に発生可能で
ある。別の実施形態では、広告配置システム１０を使用して、ウェブページ１１がその１つ以上の選択された広告と共に表示されるか、それにリンクされるように、ウェブページに文脈的に関連した１つ以上の広告１２を選択する。対象データセットに関連するように識別されたデータセットは、対象データセットと共にユーザに伝達または提示されたり、対象データセットの提示または伝達とは異なる時間に伝達または提示されたりする。データセットは、聴覚形態、ビデオ形態、視覚形態、触覚形態、機械読み取り可能形式またはこれらの任意の組み合わせなどの種々の形態または形式でユーザに伝達または提示され得る。 Selection of one or more advertisements for a particular data set or web page can occur when the data set is presented, or after or before the data set is presented to the user. In another embodiment, the advertisement placement system 10 is used to contextually associate a web page 11 such that the web page 11 is displayed with or linked to the one or more selected advertisements. One or more advertisements 12 are selected. The data set identified to be associated with the target data set may be communicated or presented to the user along with the target data set, or may be communicated or presented at a different time than the presentation or transmission of the target data set. The data set may be communicated or presented to the user in various forms or formats such as auditory form, video form, visual form, tactile form, machine readable form or any combination thereof.

各広告１２またはウェブページ１１に関連付けられたＴＳＶは、事前に計算されるか、またはオンザフライで計算され得る。１つの実施形態では、各ウェブページまたは広告は、それらのそれぞれの事前に計算されたＴＳＶの、組み込みまたは関連付けられた情報を含む。別の実施形態では、ウェブページ１１に関連付けられたＴＳＶはシステム１０により動的に計算される。 The TSV associated with each advertisement 12 or web page 11 may be calculated in advance or calculated on the fly. In one embodiment, each web page or advertisement includes embedded or associated information of their respective pre-calculated TSVs. In another embodiment, the TSV associated with the web page 11 is dynamically calculated by the system 10.

図２は、広告配置システム１０の実施形態の詳細ブロック図である。図２に示すように、広告配置システム１０は、広告１２またはウェブページ１１からキーワードを識別および取り出すための用語抽出器１０２、１１２を含む。用語抽出器１０２、１１２は、広告１２またはウェブページ１１のコンテンツに関する言語学解析を行い、広告１２またはウェブページ１１からのセンテンスを、単語、フレーズなどのより小さい単位に分割する。“ｔｈｅ”“ａ”などのような文法的単語など頻繁に使用される用語は、事前設定のストップリストを使用して除去され得る。広告１２またはウェブページ１１が実際のコンテンツ（例えば、ＨＴＭＬマークアップタグまたはＪａｖａ（登録商標）Ｓｃｒｉｐｔｉｎｇ）以外の情報を含む場合、その情報は除去され得る。用語抽出を実行するためのソフトウェアは、広く入手可能であり、当業者にとっては既知である。 FIG. 2 is a detailed block diagram of an embodiment of the advertisement placement system 10. As shown in FIG. 2, the advertisement placement system 10 includes term extractors 102, 112 for identifying and retrieving keywords from advertisements 12 or web pages 11. The term extractors 102, 112 perform a linguistic analysis on the content of the advertisement 12 or web page 11 and divide the sentence from the advertisement 12 or web page 11 into smaller units such as words, phrases, and the like. Frequently used terms such as grammatical words such as “the”, “a”, etc. can be removed using a preset stoplist. If the advertisement 12 or web page 11 contains information other than the actual content (eg, HTML markup tags or Java® Scripting), that information can be removed. Software for performing term extraction is widely available and known to those skilled in the art.

広告配置システム１０はさらに、用語抽出器１０２、１１２からの出力に基づいて、ウェブパージ１１または広告１２に対するＴＳＶを計算するためのＴＳＶ生成器１０３、１１３を含む。システム１０は、広告１２およびウェブページ１１の両方に共通のＴＳＶ生成器を使用し得る。代替的には、ウェブページ１１および広告１２からの出力をそれぞれ処理するために、別個のＴＳＶ生成器を使用し得る。 The advertisement placement system 10 further includes TSV generators 103, 113 for calculating a TSV for the web purge 11 or advertisement 12 based on the output from the term extractors 102, 112. System 10 may use a common TSV generator for both advertisement 12 and web page 11. Alternatively, separate TSV generators may be used to process the output from web page 11 and advertisement 12, respectively.

広告配置システム１０は、効率的な検索のために生成されたＴＳＶを組織化して記憶するのに使用される、ＴＳＶインデクサ１１４およびＴＳＶインデックスデータベース１１８を含む。ＴＳＶインデクサ１１４は、フルデータベース管理システム（ＤＢＭＳ）または、単に大規模データ記録管理用のソフトウェアパッケージを使用して実装され得、ＴＳＶインデックスデータベース１１８は、そのリンクを備えた広告１２のＴＳＶを含むＴＳＶインデックスファイルを記憶するデータベースを備えて実装され得る。異なるインデックススキームが適用されて、検索のスピードアップを図り得る。例えば、ＴＳＶの１つの共通インデックススキームは、ＴＳＶが参照する個々の意味カテゴリの元でそれらをリストアップするものである。 The ad placement system 10 includes a TSV indexer 114 and a TSV index database 118 that are used to organize and store TSVs generated for efficient searching. The TSV indexer 114 may be implemented using a full database management system (DBMS) or simply a software package for managing large data records, and the TSV index database 118 is a TSV that contains the TSV of the advertisement 12 with that link. It can be implemented with a database that stores the index file. Different indexing schemes can be applied to speed up the search. For example, one common indexing scheme for TSVs lists them under the individual semantic categories referenced by the TSV.

各広告１２に関連付けられたＴＳＶおよびウェブページ１１に関連付けられたＴＳＶはＴＳＶ整合器１０４に入力されて、ウェブページ１１と各広告との間のそれぞれのＴＳＶ類似性を決定する。類似性は関連スコアの形態であり得る。１つの実施形態では、ＴＳＶ間の類似性または関連は、ＴＳＶ間のＮ次元ユークリッド距離を決定するなどの、意味ベクトル（ＴＳＶ）間の距離に基づいて決定される。尚、Ｎは意味空間または所定カテゴリの次元数である。ウェブページ１１のＴＳＶと広告のＴＳＶとの間の距離が短ければ短いほど、ウェブページ１１と広告との間はより類似している。余弦測度、ハミング距離、ミンコフスキ距離またはマハラノビス距離など他の比較方法も使用可能である。比較に先立
ってＴＳＶの次元を低減したり、比較の前後に一定の広告を排除するようにファイルタを適用したりすることを含む、種々の最適化が行われて比較時間が改善可能である。 The TSV associated with each advertisement 12 and the TSV associated with the web page 11 are input to the TSV matcher 104 to determine the respective TSV similarity between the web page 11 and each advertisement. Similarity can be in the form of a related score. In one embodiment, the similarity or association between TSVs is determined based on the distance between semantic vectors (TSVs), such as determining the N-dimensional Euclidean distance between TSVs. N is the number of dimensions of the semantic space or the predetermined category. The shorter the distance between the TSV of the web page 11 and the TSV of the advertisement, the more similar between the web page 11 and the advertisement. Other comparison methods such as cosine measure, Hamming distance, Minkowski distance or Mahalanobis distance can also be used. Various optimizations can be performed to improve the comparison time, including reducing the dimension of the TSV prior to the comparison or applying a filer to eliminate certain advertisements before and after the comparison. .

ＴＳＶ比較結果に基づいて、ＴＳＶ整合器１０４は、ウェブページ１１とのそれぞれのＴＳＶ類似性に従って広告１２から選択された整合広告の格付けリストを含む、ＴＳＶ整合リスト１０５を生成する。事前設定閾値が適用されて、事前設定閾値を越えた類似度を有するような広告のみを選択し得る。 Based on the TSV comparison results, the TSV matcher 104 generates a TSV match list 105 that includes a rating list of matched advertisements selected from the advertisements 12 according to their respective TSV similarities with the web page 11. A preset threshold may be applied to select only those advertisements that have a similarity that exceeds the preset threshold.

広告配置システム１０はさらに、ウェブページ１１および広告１２に対してＴＳＶとは異なるタイプを有する、文脈表現を決定し比較するためのメカニズムを含む。１つの実施形態では、広告配置システム１０は、ウェブページ１１および広告１２の代表キーワードの情報を含む意味表現を生成する。 The advertisement placement system 10 further includes a mechanism for determining and comparing contextual expressions that have a different type than the TSV for the web page 11 and the advertisement 12. In one embodiment, the advertisement placement system 10 generates a semantic expression that includes information about the representative keywords of the web page 11 and the advertisement 12.

図２に示すように、キーワード選択器１１５、１０６は用語抽出器１０２、１１２により取り出される用語を入力し、用語頻度（どの位しばしば用語がページに発生するか）、逆文献頻度（１つの集まりにおけるページの何割が用語を含むか）または当業者には周知の他のアプローチなど１つ以上のメトリクスに従って、ウェブページ１１または各広告１２を表すために、ウェブページ１１または広告１２のコンテンツからサブセットのキーワードを選択する。例えば、キーワード選択器１１５、１０６は、ウェブページ１１または各広告の各テキストの出現頻度または出現数を計算して、各テキストの計算された出現頻度または出現数に基づいて代表キーワードを選択し得る。 As shown in FIG. 2, the keyword selectors 115, 106 input the terms retrieved by the term extractors 102, 112, the term frequency (how often the term occurs on the page), the inverse document frequency (one collection). From the content of the web page 11 or advertisement 12 to represent the web page 11 or each advertisement 12 according to one or more metrics such as other approaches known to those skilled in the art Select a subset of keywords. For example, the keyword selectors 115 and 106 may calculate the appearance frequency or the number of appearances of each text of the web page 11 or each advertisement, and select a representative keyword based on the calculated appearance frequency or the number of appearances of each text. .

別の例は、ストップリストを使用してウェブページ１１または広告１２の対象に関する情報をあまり提供しないキーワードを除去するものである。用語抽出器１０２、１１２は、対象に関する情報をあまり提供しない、最も一般的に発生する単語を含むストップリストを保有またはそれにアクセスする。ストップリストに含まれるキーワードは、良好な検索用語ではない。ストップリストは、言語専門家により、自動解析（統計的など）より、もしくはユーザにより、またはこれら３つすべての組み合わせにより作成され得る。当業者に既知の他のアプローチを使用して、ウェブページ１１または広告１２を表すために、ウェブページ１１または広告１２からキーワードを選択し得ることは理解される。 Another example is to use a stoplist to remove keywords that provide less information about the subject of the web page 11 or advertisement 12. The term extractors 102, 112 maintain or access a stoplist that includes the most commonly occurring words that provide little information about the subject. Keywords included in the stoplist are not good search terms. The stoplist can be created by a language specialist, by automatic analysis (such as statistical), by the user, or by a combination of all three. It will be appreciated that other approaches known to those skilled in the art may be used to select keywords from web page 11 or advertisement 12 to represent web page 11 or advertisement 12.

各広告の代表キーワードがキーワード選択器１１５により識別された後、代表キーワードを記憶するためにキーワードインデックスデータベース１１７が設けられ、それぞれの広告１２にリンクする。 After the representative keyword of each advertisement is identified by the keyword selector 115, a keyword index database 117 is provided to store the representative keyword and links to the respective advertisement 12.

図２に図示されるように、キーワード整合器１０７が設けられて、各それぞれの広告およびウェブページ１１を表す選択されたキーワードの情報に基づいて、ウェブページ１１と各広告１２の間のキーワード類似性を決定する。１つの実施形態では、キーワード整合器１０７は、キーワードインデックスデータベース１１７におけるウェブページ１１に対する選択されたキーワードのセットを調べて、１つ以上の既知のアルゴリズムに従って、各広告およびウェブページ１１に対してキーワード関連スコアを生成する。例えば、広告およびウェブページに含まれる整合または共通キーワード（１つの用語、１つの得票）の数に基づいて２セットの代表キーワード間の関連スコアが計算される。別の実施形態では、キーワード整合器１０７はより巧妙な得票スキーム（選挙人団、加重シェア（ｗｅｉｇｈｔｅｄｓｈａｒｅｄ）、絶対拒否権を有する貴族（ａｒｉｓｔｏｃｒａｃｙｗｉｔｈａｂｓｏｌｕｔｅｖｅｔｏ）、支持の大きさ（ｌｏｕｎｄｎｅｓｓｏｆｓｕｐｐｏｒｔ）を採用して、各広告およびウェブページ１１間の類似度を決定する。ベクトル空間モデルなどの他のタイプの計算は、直線類似性測度または修正余弦類似性測度（ｓｔｒａｉｇｈｔｏｒｍｏｄｉｆｉｅｄｃｏｓｉｎｅｓｉｍｉｌａｒｌｉｔｙｍｅａｓｕｒｅ）を使用して、関連スコアを計算し得る。 As shown in FIG. 2, a keyword matching unit 107 is provided, and the keyword similarity between the web page 11 and each advertisement 12 based on the information of each selected advertisement and the selected keyword representing the web page 11. Determine sex. In one embodiment, the keyword matcher 107 examines the selected set of keywords for the web page 11 in the keyword index database 117 and searches for the keyword for each advertisement and web page 11 according to one or more known algorithms. Generate a related score. For example, a relevance score between two sets of representative keywords is calculated based on the number of matching or common keywords (one term, one vote) included in advertisements and web pages. In another embodiment, the keyword matcher 107 is a more sophisticated voting scheme (electoral group, weighted share, aristocratic with absolute veto, roundness of support). Is used to determine the similarity between each advertisement and web page 11. Other types of calculations, such as a vector space model, can be obtained by using a straight or modified cosine similarity measure (straight or modified cosine similarity measure). Can be used to calculate a related score.

キーワード整合器１０７は、ウェブページ１１と各それぞれの広告との間のそれぞれの類似性を計算した後、ウェブページ１１とのそれぞれの類似性とそれぞれの関連スコアとに基づいて、広告１２を格付けするキーワード整合リスト１０８を生成する。 The keyword matching unit 107 calculates the respective similarity between the web page 11 and each respective advertisement, and then ranks the advertisement 12 based on the respective similarity with the web page 11 and each related score. The keyword matching list 108 to be generated is generated.

ＴＳＶ整合リスト１０５およびキーワード整合リスト１０８は、キーワード整合リスト１０８とＴＳＶ整合リスト１０５に含まれる情報に従って最終整合リスト１１０を生成する結合器１０９に送信される。１つの実施形態では、ＴＳＶ整合リスト１０５またはキーワード整合リスト１１０における各広告に対して、結合器１０９はＴＳＶ整合リスト１０５とキーワード整合リスト１１０におけるその関連スコアに基づいて、複合関連スコアを計算する。その後、最終整合リスト１１０が各広告のそれぞれの複合関連スコアに従って生成される。 The TSV matching list 105 and the keyword matching list 108 are transmitted to the combiner 109 that generates the final matching list 110 according to the information included in the keyword matching list 108 and the TSV matching list 105. In one embodiment, for each advertisement in TSV match list 105 or keyword match list 110, combiner 109 calculates a composite relevance score based on its relevance score in TSV match list 105 and keyword match list 110. A final match list 110 is then generated according to each composite association score for each advertisement.

１つの実施形態では、複合関連スコアが以下のように計算される。
広告がＴＳＶ整合リスト１０５およびキーワード整合リスト１０８両方に含まれる場合、
複合スコア＝ａ_１＊ＴＳＶスコア＋ｂ_１＊キーワードスコア＋ｃ_１（１）
広告がＴＳＶ整合リスト１０５だけに含まれる場合は、
複合スコア＝ａ_２＊ＴＳＶスコア＋ｃ_２（２）
広告がキーワード整合リスト１０８だけに含まれる場合は、
複合スコア＝ｂ_３＊キーワードスコア＋ｃ_３（３）
係数ａ_１、ａ_２、ｂ_１、ｂ_３、ｃ_１、ｃ_２、ｃ_３、は、式２および式３が式１の特別なケースとなるように選択され得る。整合リストのいずれかまたはすべてにおける関連スコアは［０，１］に正規化され得る。条件付または無条件閾値をいずれかまたはすべての整合リストにおける関連スコアに適合してリストを短縮し得る。最終整合リスト１１０は、広告の複合スコアに従ってコンパイルされる。 In one embodiment, a composite association score is calculated as follows:
If the advertisement is included in both the TSV match list 105 and the keyword match list 108,
Composite score = a ₁ * TSV score + b ₁ * keyword score + c ₁ (1)
If the advertisement is only included in the TSV match list 105,
Composite score = a ₂ * TSV score + c ₂ (2)
If your ad is only on the keyword match list 108,
Composite score = b ₃ * Keyword score + c ₃ (3)
The coefficients a ₁ , a ₂ , b ₁ , b ₃ , c ₁ , c ₂ , c ₃ can be selected so that Equation 2 and Equation 3 are special cases of Equation 1. Relevance scores in any or all of the match lists can be normalized to [0, 1]. Conditional or unconditional thresholds can be matched to relevant scores in any or all matched lists to shorten the list. The final match list 110 is compiled according to the composite score of the advertisement.

別の実施形態では、ＴＳＶ整合リスト１０５およびキーワード整合リスト１０８における広告は、一意的公式を使用して、再配置されて例示的最終整合リスト１１０を形成する。ＴＳＶ整合リスト１０５とキーワード整合リスト１０８における各広告は、それぞれのＴＳＶ関連スコアとキーワード関連スコアに関連付けられる。ＴＳＶ整合リスト１０５は、そのそれぞれのＴＳＶ関連スコアに従って広告を格付けし、キーワード整合リスト１０８はそれぞれのキーワード関連スコアに基づいて広告を格付けする。ＴＳＶ関連スコアおよびキーワード関連スコアのうちの一方は一次関連スコアとして指定され、もう一方は二次関連スコアとして指定される。 In another embodiment, advertisements in the TSV match list 105 and keyword match list 108 are rearranged to form an exemplary final match list 110 using a unique formula. Each advertisement in the TSV matching list 105 and the keyword matching list 108 is associated with a respective TSV related score and keyword related score. The TSV matching list 105 ranks advertisements according to their respective TSV-related scores, and the keyword matching list 108 ranks advertisements based on their respective keyword-related scores. One of the TSV related score and the keyword related score is designated as a primary related score, and the other is designated as a secondary related score.

表１は、一次関連スコアとしてのＴＳＶ関連スコアと、二次関連スコアとしてのキーワード関連スコアを有する例示的格付けリストを示す。例示目的のため、関連スコアは［０，１］に正規化される。 Table 1 shows an exemplary rating list with a TSV related score as a primary related score and a keyword related score as a secondary related score. For illustration purposes, the associated score is normalized to [0, 1].

各広告に対する一次関連スコアは、関連スコアの特定範囲に対応する事前設定関連レベルにマップされる。その後、広告はそれらのマップされた関連レベルに従って格付けされる。各それぞれの広告に対する二次関連スコアを使用して、各関連レベル内の広告を格付けする。 The primary relevant score for each advertisement is mapped to a preset relevant level corresponding to a specific range of relevant scores. The advertisements are then rated according to their mapped relevance level. The secondary relevance score for each respective ad is used to rate the ad within each relevance level.

例えば、表１に示す例では、ＴＳＶ関連スコアは３つの異なる関連レベルにマップされる。 For example, in the example shown in Table 1, the TSV association score is mapped to three different association levels.

関連スコア＜０．４であれば、
関連レベル＝１である。 If the related score is <0.4,
Relevance level = 1.

０．４＜＝関連スコア＜０．７であれば、
関連レベル＝２である。 If 0.4 <= relevant score <0.7,
Relevance level = 2.

関連スコア＞＝０．７であれば、
関連レベル＝３である。 If the related score> = 0.7,
Association level = 3.

変換後、広告はそれらのそれぞれの関連レベルに従って再格付けされる。各それぞれの関連レベル内の広告はその後、それらの各二次関連スコアに従って再格付けされる。再格付け結果は表２に示される。表２の列１は広告の最終関連格付けである。 After conversion, the advertisements are rerated according to their respective associated levels. Ads within each respective relevance level are then rerated according to their respective secondary relevance scores. The re-rating results are shown in Table 2. Column 1 of Table 2 is the final relevant rating of the advertisement.

広告配置システム１０はその後、最終整合リスト１１０の格付けに従って、ウェブページ１１に関係付けるために、最終整合リスト１１０から１つ以上の広告を選択する。１つの実施形態では、選択された広告はウェブページ１１と共に表示されるか、それにリンクされる。 The ad placement system 10 then selects one or more advertisements from the final match list 110 to relate to the web page 11 according to the rating of the final match list 110. In one embodiment, the selected advertisement is displayed with or linked to the web page 11.

他の実施形態では、キーワード関連スコアは一次関連スコアとして指定され得、ＴＳＶ関連スコアは二次関連スコアとして指定され得ることが理解される。また、設計の選好に依存して、異なる数の範囲レベルが使用されて、関連スコアを変換し得ることも理解される。また、条件付または無条件閾値をいずれかまたはすべての整合リストにおける関連スコアに適用して、リストを短縮し得ることも理解される。 It will be appreciated that in other embodiments, the keyword related score may be designated as a primary related score and the TSV related score may be designated as a secondary related score. It is also understood that depending on the design preference, a different number of range levels may be used to transform the associated score. It is also understood that conditional or unconditional thresholds can be applied to related scores in any or all matched lists to shorten the list.

別の実施形態では、システム１０はＴＳＶ整合リスト１０５とキーワード整合リスト１０８のうち一方だけに主として依存することにより最終整合リスト１００を生成し得る。例えば、システム１０は、それらのそれぞれのキーワード関連スコアに従って事前設定数の広告を選択するキーワード整合リスト１０８に依存する。各広告に対するＴＳＶ関連スコアもやはり計算される。キーワード格付けリスト１０８上の広告はその後、それらのそれぞれのＴＳＶ関連スコアに基づいて再格付けされる。システム１０は、最終整合リスト１１０として再格付けされた整合リストを出力する。 In another embodiment, the system 10 may generate the final match list 100 by relying primarily on only one of the TSV match list 105 and the keyword match list 108. For example, the system 10 relies on a keyword matching list 108 that selects a preset number of advertisements according to their respective keyword-related scores. A TSV related score for each advertisement is also calculated. The advertisements on the keyword rating list 108 are then rerated based on their respective TSV related scores. The system 10 outputs the re-rated match list as the final match list 110.

図３は、文脈的関連に基づいて１つ以上の広告１２をウェブページ１１に関係付けるための別の例示的広告配置システム２０を示す。考察の簡潔さのために、同じ参照番号表示を有する素子はすでに論じた同様の素子を表す。 FIG. 3 illustrates another exemplary advertisement placement system 20 for associating one or more advertisements 12 with a web page 11 based on contextual association. For brevity of discussion, elements with the same reference number designation represent similar elements already discussed.

システム２０では、広告１２に対するＴＳＴおよびキーワード意味表現はデータベース２１２内に記憶される。各広告に対して、データベース２１２は、２つのデータフィールド、すなわちＴＳＶに対して１つ、キーワード意味表現に対して１つを提供する。広告配置システム２０はさらに、ＴＳＶおよびキーワード意味表現を組織および管理するためのＴＳＶおよびキーワードインデクサ２１１を含む。ＴＳＶおよびキーワードインデクサ２１１は完全なデータベース管理システム（ＤＢＭＳ）または、単に大規模データ記録管理のためのソフトウェアパッケージを使用して実装され得、データベース２１２は、データ
ベースを備えて実装され得る。異なるインデックススキームを適用して検索をスピードアップし得る。 In the system 20, the TST and keyword semantic representation for the advertisement 12 is stored in the database 212. For each advertisement, database 212 provides two data fields: one for TSV and one for keyword semantic representation. The advertisement placement system 20 further includes a TSV and keyword indexer 211 for organizing and managing TSVs and keyword semantic expressions. The TSV and keyword indexer 211 can be implemented using a complete database management system (DBMS) or simply a software package for managing large data records, and the database 212 can be implemented with a database. Different indexing schemes can be applied to speed up the search.

システム２０は用語抽出器１０２および１１２と、ＴＳＶ生成器１０３および１１３と、キーワード選択器１０６および１１５とを含み、すべて、図２に関してすでに説明したものと同じ機能性を備える。各広告に対して、ＴＳＶおよびキーワード結合器２１０がそのＴＳＶおよびキーワード意味表現を広告と適切に関連付ける。同様に、ウェブページ１１に対しては、ＴＳＶはＴＳＶ生成器１０３により生成され、キーワード意味表現はキーワード選択器１０６に生成される。ＴＳＶおよびキーワード結合器２０５がそのＴＳＶおよびキーワード意味表現をウェブページ１１に関連付け、またはリンクさせる。ウェブページ１１および広告１２に対するＴＳＶおよびキーワード意味表現に関する情報は、ＴＳＶおよびキーワード整合器２０６により処理され、ＴＳＶおよびキーワード整合器１０７は、図２に関してすでに論じたＴＳＶ整合器１０４およびキーワード整合器２０６と同様の機能を実行する。ＴＳＶおよびキーワード意味表現に対する関連スコアは、図２に関して説明されたのと同様に計算され得る。最終整合リスト２１３は図２に関してすでに論じたようにＴＳＶおよびキーワード整合器２０６により生成される。 System 20 includes term extractors 102 and 112, TSV generators 103 and 113, and keyword selectors 106 and 115, all with the same functionality as previously described with respect to FIG. For each advertisement, TSV and keyword combiner 210 properly associates that TSV and keyword semantic representation with the advertisement. Similarly, for the web page 11, the TSV is generated by the TSV generator 103, and the keyword meaning expression is generated by the keyword selector 106. The TSV and keyword combiner 205 associates or links the TSV and keyword semantic expression with the web page 11. Information about TSV and keyword semantic representation for web page 11 and advertisement 12 is processed by TSV and keyword matcher 206, which includes TSV matcher 104 and keyword matcher 206 previously discussed with respect to FIG. Perform similar functions. Relevance scores for TSVs and keyword semantic expressions can be calculated as described with respect to FIG. The final match list 213 is generated by the TSV and keyword matcher 206 as previously discussed with respect to FIG.

別の実施形態では、同じベクトル空間におけるデータセットのキーワード意味表現および意味ベクトル表現を組み合わせることにより、各広告または各候補もしくはターゲットデータセットに対する結合関連スコアが計算され得る。例えば、広告のキーワード表現および意味ベクトル表現の両方が、同じベクトル空間のベクトルとして処理されて、組み合わされて、広告の単一結合意味ベクトル表現を形成する。 In another embodiment, a combined relevance score for each advertisement or each candidate or target data set may be calculated by combining keyword semantic representations and semantic vector representations of the data sets in the same vector space. For example, both the keyword representation and the semantic vector representation of the advertisement are treated as vectors in the same vector space and combined to form a single combined semantic vector representation of the advertisement.

結合ベクトル意味表現を計算する際に、意味ベクトル表現およびキーワード意味表現が異なる重みが割り当てられ得る。各広告に対して、関連スコアは、広告の結合意味ベクトル表現およびターゲットデータセットの結合意味ベクトル表現に基づいて計算される。最終整合リスト２１３は、広告のそれぞれの結合関連スコアに従って、ＴＳＶおよびキーワード整合器２０６により生成される。 In calculating the combined vector semantic representation, different weights may be assigned to the semantic vector representation and the keyword semantic representation. For each advertisement, a relevance score is calculated based on the combined semantic vector representation of the advertisement and the combined semantic vector representation of the target data set. The final match list 213 is generated by the TSV and keyword matcher 206 according to each combined relevance score of the advertisement.

キーワードまたはＴＳＶ比較に基づいて生成される整合リストはさらに、他の既知の方法により絞り込まれるか、再格付け可能であることが理解される。例えば、格付けリストにおけるデータセットまたはウェブページは、全体的開示が参照により本明細書に組み込まれる、“ＭＥＴＨＯＤＦＯＲＮＯＤＥＲＡＮＫＩＮＧＩＮＡＬＩＮＫＥＤ
ＤＡＴＡＢＡＳＥ”と題された、米国特許第６，２８５，９９９号に記載されるＧｏｏｇｌｅ，Ｉｎｃ．により開発されたＰａｇｅＲａｎｋアルゴリズムなどの、最終格付けにおけるウェブページ間のリンク情報に従ったアルゴリズムを使用して再配置され得る。 It will be understood that the match list generated based on keywords or TSV comparisons can be further refined or re-rated by other known methods. For example, a dataset or web page in a rating list may be stored in a “METHOD FOR NODE RANKING IN A LINKED”, the entire disclosure of which is incorporated herein by reference.
Using an algorithm according to link information between web pages in the final rating, such as the PageRank algorithm developed by Google, Inc., described in US Pat. No. 6,285,999, entitled “DATABASE” Can be rearranged.

ＴＳＶの構築
これよりデータセットに対するＴＳＶの構成について説明する。ＴＳＶの一層の詳細は、その開示がすでに参照により組み込まれる、米国特許第６，７５１，６２１号および米国特許出願第１１／１２６，１８４号に記載される。 Construction of TSV The structure of TSV for the data set will be described. Further details of TSV are described in US Pat. No. 6,751,621 and US application Ser. No. 11 / 126,184, the disclosures of which are already incorporated by reference.

データセットに対するＴＳＶの生成に備えて、意味辞書を使用して、データセットに含まれるデータポイントに対応するＴＳＶを見つける。意味辞書は、複数の既知のデータポイントと複数の所定カテゴリとの間の既知の関係を含む。言い換えると、意味辞書は、「定義」、すなわち対応する単語またはフレーズのＴＳＶを含む。 In preparation for generation of TSVs for a data set, a semantic dictionary is used to find TSVs corresponding to data points contained in the data set. The semantic dictionary includes known relationships between a plurality of known data points and a plurality of predetermined categories. In other words, the semantic dictionary contains “definitions”, ie TSVs of corresponding words or phrases.

これより、ＴＳＶ生成器を使用してデータセットに対するＴＳＶを生成するための例示的プロセスについて説明する。データセットは、広告、ウェブページあるいは任意のタイプのデータセットが可能である。例示目的で、「単語」は文書に含まれるキーワードに対
する例として使用される。単語、フレーズ、記号、用語、ハイパーリンク、メタデータ情報、グラフィックおよび／または任意の表示もしくは未表示項目あるいはこれらの任意の組み合わせなど、他の多くのタイプのデータポイントまたはキーワードが文書に含まれ得ることが理解される。 An exemplary process for generating a TSV for a data set using a TSV generator will now be described. The data set can be an advertisement, a web page, or any type of data set. For illustration purposes, “word” is used as an example for a keyword contained in a document. Many other types of data points or keywords can be included in a document, such as words, phrases, symbols, terms, hyperlinks, metadata information, graphics and / or any displayed or hidden items or any combination thereof It is understood.

文書の入力キーワードに基づいて、ＴＳＶ生成器は、意味辞書における対応するキーワードを識別して、意味辞書により与えられる定義に基づいて、文書に含まれる各キーワードのそれぞれのＴＳＶを取り出す。ＴＳＶ生成器１０３は、文書に含まれるキーワードのそれぞれのＴＳＶを組み合わせることにより文書のＴＳＶを生成する。例えば、文書のＴＳＶは、文書に含まれるすべてのキーワードのそれぞれのＴＳＶのベクトル加法として定義され得る。 Based on the input keyword of the document, the TSV generator identifies the corresponding keyword in the semantic dictionary and retrieves each TSV of each keyword contained in the document based on the definition given by the semantic dictionary. The TSV generator 103 generates a TSV of a document by combining each TSV of keywords included in the document. For example, the TSV of a document can be defined as a vector addition of each TSV of all keywords contained in the document.

意味辞書作成のためのプロセスについてこれより説明する。１つの実施形態では、複数の既知のデータセットのそれぞれがどの所定カテゴリに入るかを適切に決定することにより、意味辞書が生成される。サンプルデータセットは、１つ以上の所定カテゴリに入り得るか、またはサンプルデータセットは単一カテゴリに関連するように制限され得る。例えば、コンピュータ会社に関わる特許権侵害訴訟に関するニュースレポートは、レポートの内容に依存して、また所定カテゴリに依存して、「知的財産法」「ビジネス論争」「オペレーティングシステム」「経済問題」などを含むカテゴリに入り得る。サンプルデータセットが一定の所定カテゴリに関係付けられていると決定されると、サンプルデータセットに含まれるすべてのキーワードが同一の予め定められたカテゴリに関連付けられる。同じプロセスがすべてのサンプルデータセットに関して行われる。 The process for creating a semantic dictionary will now be described. In one embodiment, a semantic dictionary is generated by appropriately determining which predetermined category each of a plurality of known data sets falls into. The sample data set can be in one or more predetermined categories, or the sample data set can be restricted to be associated with a single category. For example, a news report on a patent infringement lawsuit involving a computer company depends on the content of the report and on a predetermined category, such as "Intellectual Property Law" "Business Controversy" "Operating System" "Economic Issues" Can fall into a category containing If it is determined that the sample data set is associated with a certain predetermined category, all keywords included in the sample data set are associated with the same predetermined category. The same process is performed for all sample data sets.

１つの実施形態では、サンプル文書とカテゴリの間の関係は、オープンディレクトリプロジェクト（ＯＤＰ）を解析することにより決定可能であり、オープンディレクトリプロジェクトは、専門の編集者により何１０万のウェブページを豊富なトピック階層に割り当てたものである。割り当てられたカテゴリを有するこれらのサンプルウェブページは、キーワードと所定カテゴリとの間の関係を決定するためのトレーニング文書（ｔｒａｉｎｉｎｇｄｏｃｕｍｅｎｔ）と呼ばれる。他のオンライントピック階層、分類スキームおよびオントロジを同様に使用して、サンプルトレーニング文書をカテゴリに関係付けることができることは当業者には明白なはずである。 In one embodiment, the relationship between sample documents and categories can be determined by analyzing an Open Directory Project (ODP), which is enriched by hundreds of thousands of web pages by professional editors. Assigned to different topic hierarchies. These sample web pages with assigned categories are referred to as training documents for determining the relationship between keywords and predetermined categories. It should be apparent to those skilled in the art that other online topic hierarchies, classification schemes and ontologies can be used as well to relate sample training documents to categories.

以下のステップは、ＯＤＰ階層がＴＳＶ意味辞書を生成する目的でどのように変形されるかを説明する。 The following steps explain how the ODP hierarchy is transformed for the purpose of generating a TSV semantic dictionary.

１．ＯＤＰウェブページをダウンロードする。各ウェブページと、それが属するＯＤＰカテゴリとの間の関連付けが保持される。適切にダウンロードしなかったいかなるウェブページも除去して、ＵＲＬＳを内部パス名に翻訳する。 1. Download the ODP web page. An association is maintained between each web page and the ODP category to which it belongs. Remove any web pages that did not download properly and translate URLS to internal pathnames.

２．オプションで、上記ＯＤＰウェブページのいずれかにより参照されるすべてのウェブページをダウンロードして、各新しいウェブページと元のＯＤＰウェブページが属するＯＤＰカテゴリとの間の関連付けを作成する。オプションでウェブページをフィルタに掛けて、それが引き出された元のＯＤＰウェブページと同じカテゴリを有するような新しいウェブページのみを維持する。適切にダウンロードしなかったいかなるウェブページも除去し、ＵＲＬを内部パス名に翻訳する。 2. Optionally, download all web pages referenced by any of the above ODP web pages and create an association between each new web page and the ODP category to which the original ODP web page belongs. Optionally filter the web page to keep only new web pages that have the same category as the original ODP web page from which it was pulled. Remove any web pages that did not download properly and translate the URL to an internal path name.

３．オプションで所望しないカテゴリを除去する。一定タイプのＯＤＰカテゴリは処理前に除去される。これらの除去されたカテゴリは、空のカテゴリ（対応する文書がないカテゴリ）、レターバーカテゴリ（有用な意味的差異がない「Ａ、Ｂ・・・で始まる映画タイトル」）、および意味コンテンツを識別するのに有用な情報を含まない（例えば、空の
カテゴリ、所望しない外国語の地域的なページ）、または誤解を与えるもしくは不適切な情報（例えばアダルトコンテンツページ）を含む、他のカテゴリを含み得る。 3. Optionally remove unwanted categories. Certain types of ODP categories are removed before processing. These removed categories identify empty categories (category with no corresponding document), letter bar categories ("movie titles starting with A, B ..." with no useful semantic differences), and semantic content Includes other categories that do not contain useful information (eg, empty categories, undesired foreign language regional pages), or misleading or inappropriate information (eg adult content pages) obtain.

４．トレーニングに適切でないページを除去する。１つの実施形態では、少なくとも最少量のコンテンツを有するページのみトレーニングに使用される。別の実施形態では、トレーニングページは、少なくとも１０００バイトの変換されたテキストおよび最大５０００の空白区切り単語を有さなければならない。 4). Remove pages that are not appropriate for training. In one embodiment, only pages with at least a minimal amount of content are used for training. In another embodiment, the training page must have at least 1000 bytes of converted text and up to 5000 whitespace delimited words.

５．オプションで、英語で書かれていないいかなるページも除去する。これは、ＨＴＭＬメタタグ、自動言語検出、ＵＲＬドメイン名に関するフィルタリング、文字範囲に関するフィルタリングまたは当業者にはよく知られる他の技術などの標準方法により行うことができる。 5. Optionally remove any pages that are not written in English. This can be done by standard methods such as HTML meta tags, automatic language detection, filtering on URL domain names, filtering on character ranges, or other techniques well known to those skilled in the art.

６．オプションで重複を除去する。ページが２つ以上のＯＤＰカテゴリに現れる場合、それはあいまいに分類され、トレーニングの良好な候補でないことがあり得る。 6). Optionally remove duplicates. If a page appears in more than one ODP category, it is classified as ambiguous and may not be a good candidate for training.

７．候補ＴＳＶ次元を識別する。以下で説明するように崩壊−切り取り（ｃｏｌｌａｐｓｅ−ｔｒｉｍ）アルゴリズムを起動して自動的にＯＤＰ階層を平らにして候補ＴＳＶ次元を識別する。 7). Identify candidate TSV dimensions. Invoke a collapse-trim algorithm to automatically flatten the ODP hierarchy and identify candidate TSV dimensions as described below.

８．オプションでＴＳＶ次元を調節する。自動的に生成されたＴＳＶ次元を調べて、それらの次元の予想される意味特性に基づいて、特定次元を手動で崩壊、分割または除去する。調節のタイプは以下を含むことが可能だが、それらに限定されない。まず、一定の単語が元のカテゴリ名に頻繁に発生したら、それらのカテゴリはそれらの親ノードまで崩壊されることが可能である（それらはすべて同じことを論じているか、または意味論的に意味がないのいずれかであるので）。第２に、一定の特定カテゴリはその親まで崩壊可能である（通常、それらがあまりにも特定すぎるので）。第３に、ＯＤＰ階層において分離された一定グループのカテゴリは、統合可能である（例えば、“Ａｒｔｓ／ＭａｇａｚｉｎｅｓａｎｄＥ−Ｚｉｎｅｓ／Ｅ−Ｚｉｎｅｓ”は“Ａｒｔｓ／ＯｎｌｉｎｅＷｒｉｔｉｎｇ／Ｅ−ｚｉｎｅｓ”と統合可能である）。 8). Optionally adjust the TSV dimension. Examine the automatically generated TSV dimensions and manually collapse, split or remove specific dimensions based on the expected semantic properties of those dimensions. The types of adjustments can include, but are not limited to: First, if certain words occur frequently in the original category name, those categories can be collapsed to their parent node (they all discuss the same thing or have semantic meaning) Because there is no one). Second, certain specific categories can collapse to their parents (usually because they are too specific). Third, certain groups of categories separated in the ODP hierarchy can be integrated (eg, “Arts / Magazines and E-Zines / E-Zines” can be integrated with “Arts / Online Writing / E-Zines”). Is).

９．ＴＳＶトレーニングファイルを作成する。各潜在的トレーニングページに対して、そのページを、ページのカテゴリが崩壊したＴＳＶ次元に関連付ける。その後、過剰トレーンまたは過小評価（ｕｎｄｅｒｓａｍｐｌｅ）にならないように注意しながら、その次元をトレーンする（ｔｒａｉｎ）のに使用されることになる各ＴＳＶ次元からページを選択する。１つの実施形態では、我々は、少なくとも１０００バイトの変換されたテキストを有する３００ページをランダムに選択する（該当ページが３００未満の場合は、それらすべて選択する）。その後、５０００の空白区切り単語より長いいかなるページも除去して、最小ページから開始して、累積単号カウントが２００，０００に達すると停止しながら、その次元全体に対して最大２００，０００空白区切り単語を保持する。 9. Create a TSV training file. For each potential training page, associate that page with the TSV dimension in which the page category collapsed. A page is then selected from each TSV dimension that will be used to train that dimension, taking care not to become overtrained or undersampled. In one embodiment, we randomly select 300 pages with converted text of at least 1000 bytes (if the page is less than 300, select them all). It then removes any page longer than 5000 blank separator words, starts with the smallest page, stops when the cumulative unit count reaches 200,000, and up to 200,000 blank separators for the entire dimension. Hold a word.

１０．オプションで次元に対してラベル付けし直す。各次元は、それが引き出されたＯＤＰカテゴリのオントロジパスと同じラベルを有して開始する。１つの実施形態では、ラベルの一部は、短縮して、より読みやすくし、それらが組み合わされるか、除去された様々なサブカテゴリを確実に反映するように、手動で調節される。例えば、“Ｔｏｐ／Ｓｈｏｐｐｉｎｇ／Ｖｅｈｉｃｌｅｓ／Ｍｏｔｏｒｃｙｃｌｅｓ／Ｐａｒｔｓ＿ａｎｄ＿Ａｃｃｅｓｓａｒｉｅｓ／Ｈａｒｌｅｙ＿Ｄａｖｉｄｓｏｎ”という元のラベルは“ＨａｒｌｅｙＤａｖｉｄｓｏｎ，ＰａｒｔｓａｎｄＡｃｃｅｓｓａｒｉｅｓ”に書き換えられるかもしれない。 10. Optionally relabel the dimensions. Each dimension starts with the same label as the ontology path of the ODP category from which it was derived. In one embodiment, some of the labels are manually adjusted to shorten and make them more readable and ensure that they reflect the various subcategories that have been combined or removed. For example, the original label “Top / Shopping / Vehicles / Motorcycles / Parts_and_Accessories / Harley_Davidson” may be rewritten to “Harley Davidson, Parts and Accessories”.

１つの実施形態では、崩壊−切り取りアルゴリズムは、各カテゴリノードで直接有用なページ数を見ながら、ＯＤＰ階層中を下から上へ進む。そのノードに少なくとも１００ページが記憶されている場合、ＴＳＶ次元としてそのノードを保持する。そうでなければ、親ノードにそれを崩壊する。 In one embodiment, the collapse-cut algorithm proceeds from bottom to top in the ODP hierarchy, looking at the number of pages that are directly useful at each category node. If at least 100 pages are stored in the node, the node is retained as a TSV dimension. Otherwise, collapse it to the parent node.

サンプルデータセットの所定カテゴリ（次元）への割り当てが行われた後、１つ以上のサンプルデータセットに含まれるキーワードと所定カテゴリとの間の関係を表す情報を記憶するデータテーブルが、割り当て結果に基づいて作成される。データテーブルにエントリするたびに、キーワードと所定カテゴリの１つとの間の関係が確立する。例えば、データテーブルのエントリは、特定キーワードを含む、カテゴリ内のサンプルデータセットの数に対応可能である。キーワードは、サンプルデータセットのコンテンツに対応し、一方所定カテゴリは、意味空間の次元に対応する。データテーブルを使用して、トレーン可能意味ベクトルを構築するのに用いる、所定カテゴリにより形成された特定意味空間内の各単語、フレーズまたは他のキーワードの「定義」を含む意味辞書を生成し得る。 After the sample data set is assigned to a predetermined category (dimension), a data table that stores information representing the relationship between the keywords included in the one or more sample data sets and the predetermined category is displayed in the assignment result. Created based on. Each entry in the data table establishes a relationship between the keyword and one of the predetermined categories. For example, an entry in the data table can correspond to the number of sample data sets in the category that contain a particular keyword. Keywords correspond to the contents of the sample data set, while predetermined categories correspond to the dimensions of the semantic space. The data table may be used to generate a semantic dictionary that includes a “definition” of each word, phrase, or other keyword in a specific semantic space formed by a predetermined category that is used to construct a trainable semantic vector.

図４は、意味辞書を構築するための例示的データテーブルを示す。簡潔さおよび理解を容易にするために、図４の単語数および所定カテゴリ数は５つに低減される。実際には、何１０万の用語と所定カテゴリが可能である。 FIG. 4 shows an exemplary data table for building a semantic dictionary. For simplicity and ease of understanding, the number of words and the predetermined number of categories in FIG. 4 are reduced to five. In practice, hundreds of thousands of terms and predetermined categories are possible.

図４に図示されるように、テーブル２００は、所定カテゴリＣａｔ_１、Ｃａｔ_２、Ｃａｔ_３、Ｃａｔ_４およびＣａｔ_５に対応する行４１０ならびに代表単語Ｗ_１、Ｗ_２、Ｗ_３、Ｗ_４およびＷ_５に対応する列４１２を含む。テーブル２００内の各エントリ４１４は、１つ以上の単語Ｗ_１、Ｗ_２、Ｗ_３、Ｗ_４およびＷ_５など、対応するカテゴリに発生する特定単語を有するドキュメント数に対応する。 As shown in FIG. 4, the table 200 includes a row 410 corresponding to predetermined categories Cat ₁ , Cat ₂ , Cat ₃ , Cat ₄ and Cat ₅ and representative words W ₁ , W ₂ , W ₃ , W ₄ and W. Column 412 corresponding to ₅ is included. Each entry 414 in table 200 corresponds to the number of documents having a particular word that occurs in the corresponding category, such as one or more words W ₁ , W ₂ , W ₃ , W _4, and W ₅ .

各行４１０に亘る列４１２の総数の合計は、その行４１０毎に表される単語を含む文書の総数を与える。これらの値は列４１６に表される。図４を参照すると、単語Ｗ_１は、カテゴリＣａｔ_２に２０回、カテゴリＣａｔ_５に８回現れる。単語Ｗ_１は、カテゴリＣａｔ_１、Ｃａｔ_３およびＣａｔ_４には現れない。 The sum of the total number of columns 412 across each row 410 gives the total number of documents that contain the word represented for that row 410. These values are represented in column 416. Referring to FIG. 4, the word W ₁ appears 20 times in the category Cat ₂ and 8 times in the category Cat ₅ . The word W ₁ does not appear in the categories Cat ₁ , Cat ₃ and Cat ₄ .

列４１６を参照すると、単語Ｗ_１はすべてのカテゴリに亘って合計２８回現れる。言い換えると、２８の分類された文書がＷ_１を含む。Ｃａｔ_１などの例示的列４１２を調べると、単語Ｗ_２がカテゴリＣａｔ_１で１回現れ、単語Ｗ_３がカテゴリＣａｔ_１で８回現れ、単語Ｗ_５がカテゴリＣａｔ_１で２回現れる。単語Ｗ_４はカテゴリＣａｔ_１ではまったく現れない。すでに述べたように、単語Ｗ_１はカテゴリ１では現れない。行４１８を参照すると、カテゴリＣａｔ_１に対応するエントリは、カテゴリＣａｔ_１に分類された文書が１１個あることを示す。 Referring to the column 416, the words W ₁ appears a total of 28 times across all categories. In other words, 28 classified documents contains W _1. Examining an exemplary column 412 such as Cat ₁ , word W ₂ appears once in category Cat ₁ , word W ₃ appears eight times in category Cat ₁ , and word W ₅ appears twice in category Cat ₁ . Word _{W 4} does not appear at all in the category Cat _1. As already mentioned, the word W ₁ does not appear in category 1. Referring to line 418, the entry corresponding to the category Cat ₁ indicates that the documents categorized Cat ₁ is 11.

１つの実施形態によると、データテーブルが構築された後、データテーブルの各エントリの重要性が決定される。一定の状況下におけるエントリの重要性は、単語が特定カテゴリまたは特定カテゴリに対するその関連性において発生する相対強度であるとみなすことができる。但し、かかる関係は限定的にみなすべきではない。各エントリの重要性は実際のデータセットおよびカテゴリ（すなわち、カテゴリを表し記述するのに重要であるとみなされる特徴）にのみ限定される。本開示の１つの実施形態によると、各単語の重要性は、すべてのカテゴリに亘る単語の統計的動作に基づいて決定される。これは、以下の公式に従って各カテゴリに発生するキーワードの割合を最初に計算することにより達成可能である：
μ＝Ｐｒｏｂ（エントリ｜カテゴリ）＝（エントリ_ｎ，カテゴリ_ｍ）／カテゴリ_{ｍ＿ｔｏｔａｌ}
次に、全カテゴリに亘るキーワード発生の確率分布を以下の公式に従って計算する。
ν＝Ｐｒｏｂ（エントリ｜カテゴリ）＝（エントリ，カテゴリ_ｍ）／エントリ_{ｎ＿ｔｏｔａｌ}
μもνも共に、単語が特定カテゴリに関連付けられた強度を表す。例えば、単語があるカテゴリからの少数のデータセットしか発生せず、他のどのカテゴリでも現れない場合、それはそのカテゴリに関しては、高いν値および低いμ値を有することになろう。エントリがあるカテゴリからのほとんどのデータセットに現れるが、他のいくつかのカテゴリにも現れる場合、それはそのカテゴリに対して高いμ値および低いν値を有することになろう。 According to one embodiment, after the data table is constructed, the importance of each entry in the data table is determined. The importance of an entry under certain circumstances can be considered as the relative strength that a word occurs in a particular category or its relevance to a particular category. However, this relationship should not be regarded as limiting. The importance of each entry is limited only to the actual data set and category (ie, features that are considered important to represent and describe the category). According to one embodiment of the present disclosure, the importance of each word is determined based on the statistical behavior of the word across all categories. This can be achieved by first calculating the percentage of keywords occurring in each category according to the following formula:
μ = Prob (entry | category) = (entry _n , category _m ) / category _{m_total}
Next, the probability distribution of the keyword occurrence over all categories is calculated according to the following formula.
ν = Prob (entry | category) = (entry, category _m ) / entry _{n_total}
Both μ and ν represent the strength with which a word is associated with a particular category. For example, if a word only occurs in a small number of datasets from a category and does not appear in any other category, it will have a high v value and a low μ value for that category. If an entry appears in most datasets from a category but also appears in some other category, it will have a high μ value and a low ν value for that category.

表されている情報の質およびタイプに依存して、各単語の決定された重要性を高めるために追加のデータ操作が実行可能である。例えば、あるキーワードに対する全値の合計により各カテゴリに対するμの値が正規化（すなわち、除算）されて、その結果確率分布としての解釈が可能となる。 Depending on the quality and type of information being represented, additional data manipulations can be performed to increase the determined importance of each word. For example, the value of μ for each category is normalized (ie, divided) by the sum of all values for a certain keyword, and as a result, interpretation as a probability distribution is possible.

以下の公式に従って、キーワードの重要性を決定するのにμおよびνの加重平均も使用可能である：
α（ν）＋（１−α）（μ）
可変αは、表されて解析されている情報に基づいて決定可能である加重因数である。本開示の１つの実施形態によると、加重因数は約０．７５の値を有する。情報のタイプ及び質または情報を表すのに必要な詳細レベルなどの種々の要素に依存して、他の値が選択可能である。実験から収集される経験的証拠を通して、本発明者は、μおよびνベクトルの加重平均が、μのみ、νのみを使用またはμとνの加重されない組み合わせを使用して達成可能な結果より優れた結果を生じることができると判断した。 A weighted average of μ and ν can also be used to determine the importance of keywords according to the following formula:
α (ν) + (1−α) (μ)
The variable α is a weighting factor that can be determined based on the information being represented and analyzed. According to one embodiment of the present disclosure, the weighting factor has a value of about 0.75. Other values can be selected depending on various factors such as the type and quality of information or the level of detail required to represent the information. Through empirical evidence collected from experiments, we have found that the weighted average of the μ and ν vectors is superior to the results achievable using only μ, using only ν, or using unweighted combinations of μ and ν. It was determined that a result could be produced.

図５は、図４からのデータに基づいて、上述の操作処理の動作を図示する。図５では、テーブル２３０はカテゴリに関する各単語の相対強度を示す値を記憶する。具体的には、各カテゴリに発生するキーワードの割合（すなわちμ）が、各単語に対するベクトルの形態で提示される。μベクトルにおける各エントリに対する値は、以下の公式に従って計算される：
μ＝Ｐｒｏｂ（単語｜カテゴリ）＝（単語_ｎ，カテゴリ_ｍ）／カテゴリ_{ｍ＿ｔｏｔａｌ}
テーブル２３０もまた、各単語に対するベクトルの形態ですべてのカテゴリに亘るキーワードの発生確率分布（すなわちν）を提示する。νベクトルにおける各エントリに対する値は以下の公式に従って計算される：
ν＝Ｐｒｏｂ（カテゴリ｜エントリ）＝（単語_ｎ，カテゴリ_ｍ）／単語_{ｎ＿ｔｏｔａｌ}
図６を参照すると、図４からの単語の意味表現または「定義」を例示するためにテーブル２５０が示される。テーブル２５０は、意味空間に亘る各単語の意味表現に対応する５つのＴＳＶの組み合わせである。例えば、第１行は、単語Ｗ_１のＴＳＶに対応する。各ＴＳＶは所定カテゴリに対応する次元を有する。加えて、単語Ｗ_１、Ｗ_２、Ｗ_３、Ｗ_４およびＷ_５に対するＴＳＶは、その特定カテゴリに関して単語の重要性を最適化するようにエントリがスケールされる開示の実施形態により計算される。より詳細には、以下の公式を使用して値が計算される。
α（ν）＋（１−α）（μ）
各ＴＳＶに対するエントリは、テーブル２３０に記憶された実際の値に基づいて計算される。従って、テーブル２５０に示されるＴＳＶは、所定カテゴリにより形成される意味空間に対して意味辞書を集合的に構成する、各所定カテゴリまたはベクトル次元に対して図４に表される例示的単語Ｗ_１、Ｗ_２、Ｗ_３、Ｗ_４およびＷ_５の「定義」に対応する。 FIG. 5 illustrates the operation processing described above based on the data from FIG. In FIG. 5, the table 230 stores a value indicating the relative strength of each word related to the category. Specifically, the ratio of keywords occurring in each category (that is, μ) is presented in the form of a vector for each word. The value for each entry in the μ vector is calculated according to the following formula:
μ = Prob (word | category) = (word _n , category _m ) / category _{m_total}
Table 230 also presents the probability distribution (ie, ν) of keywords across all categories in the form of vectors for each word. The value for each entry in the ν vector is calculated according to the following formula:
ν = Prob (category | entry) = (word _n , category _m ) / word _{n_total}
Referring to FIG. 6, a table 250 is shown to illustrate the semantic representation or “definition” of the words from FIG. Table 250 is a combination of five TSVs corresponding to the semantic representation of each word across the semantic space. For example, the first row, corresponding to the TSV of word _{W 1.} Each TSV has a dimension corresponding to a predetermined category. In addition, the TSVs for the words W ₁ , W ₂ , W ₃ , W ₄ and W ₅ are calculated according to the disclosed embodiment in which entries are scaled to optimize the importance of the word with respect to that particular category. More specifically, the value is calculated using the following formula:
α (ν) + (1−α) (μ)
An entry for each TSV is calculated based on the actual values stored in table 230. Accordingly, the TSVs shown in table 250 collectively constitute a semantic dictionary for a semantic space formed by predetermined categories, and the exemplary word W ₁ represented in FIG. 4 for each predetermined category or vector dimension. , W ₂ , W ₃ , W ₄ and W ₅ .

ときには、広告されている製品の市場に対してローカルな文書に広告を載せることが望ましいことがある。これは広告に地理的情報（郵便番号、市／州名など）を組み込むことにより、またはユーザのＩＰアドレスにアクセスしてそれを地理的領域に関連付けること
により達成され得る。しかしながら、すべての文書が適切な形態の地理的情報を含むとは限らず、またすべてのユーザが、自分のローカル地域に対応するＩＰアドレスを有するとは限らない。この場合、上述のような意味辞書の形成時に、地理的領域に関係付けられたさらなるカテゴリが所定カテゴリに含まれることが可能である。各地理的領域は意味空間における次元となり、意味辞書を作成するのに地理的情報でタグ付けされたサンプルデータセットが使用される。その後その意味辞書を使用して、データセットおよび広告が異なる地理的領域に関連付けられる強度を反映する、データセットおよび広告に対するＴＳＶを生成する。 Sometimes it may be desirable to place advertisements in documents that are local to the market for the product being advertised. This can be accomplished by incorporating geographic information (zip code, city / state name, etc.) in the advertisement or by accessing the user's IP address and associating it with the geographic region. However, not all documents contain the proper form of geographic information, and not all users have IP addresses that correspond to their local regions. In this case, when the semantic dictionary as described above is formed, a further category related to the geographical area can be included in the predetermined category. Each geographic region becomes a dimension in the semantic space, and a sample data set tagged with geographic information is used to create a semantic dictionary. The semantic dictionary is then used to generate TSVs for the datasets and advertisements that reflect the strength with which the datasets and advertisements are associated with different geographic regions.

ＴＳＶの適用は単に１つの言語に限定されない。適切なサンプルデータセットが有用である限り、様々な言語に対する意味辞書を構築することが可能である。例えば、オープンディレクトリプロジェクトからの英語サンプルデータセットは、意味辞書を生成する際に別の言語の適切なサンプルデータセットと置き換えることが可能である。各言語に対して別個の意味辞書が存在可能である。代替的には、すべての言語に対するキーワードが単一の共通意味辞書に常駐することが可能である。様々な言語は、同一の意味辞書を共有するかどうか、また言語に亘って意味ベクトルを比較することが所望されるかどうかに依存して、同一所定カテゴリまたは意味次元を共有し得るか、または完全に異なる所定カテゴリまたは意味次元を有し得る。 The application of TSV is not limited to just one language. As long as a suitable sample data set is useful, it is possible to build a semantic dictionary for various languages. For example, an English sample data set from an Open Directory project can be replaced with an appropriate sample data set in another language when generating a semantic dictionary. There can be a separate semantic dictionary for each language. Alternatively, keywords for all languages can reside in a single common semantic dictionary. Different languages may share the same predetermined categories or semantic dimensions, depending on whether they share the same semantic dictionary and whether it is desired to compare semantic vectors across languages, or It can have completely different predefined categories or semantic dimensions.

意味辞書が作成された後、意味辞書はＴＳＶ生成器１０３によりアクセスされて、ターゲット文書に含まれるキーワードに対して対応するＴＳＶを見つけることができる。１つの実施形態では、ターゲット文書に含まれるキーワードのＴＳＶは組み合わされてターゲット文書のＴＳＶを生成する。ＴＳＶが組み合わされる方法は、具体的実装に依存する。例えば、ＴＳＶはベクトル加法演算を使用して組み合わされ得る。この場合、文書に対するＴＳＶは以下のように表すことができる：
ＴＳＶ（文書）＝ＴＳＶ（Ｗ１）＋ＴＳＶ（Ｗ２）＋ＴＳＶ（Ｗ３）．．．＋ＴＳＶ（ＷＮ）
尚、Ｗ１、Ｗ２、Ｗ３、．．．ＷＮは文書に含まれる単語である。 After the semantic dictionary is created, the semantic dictionary can be accessed by the TSV generator 103 to find the corresponding TSV for the keyword included in the target document. In one embodiment, the keyword TSVs included in the target document are combined to generate a TSV for the target document. The way in which TSVs are combined depends on the specific implementation. For example, TSVs can be combined using vector addition operations. In this case, the TSV for the document can be expressed as:
TSV (document) = TSV (W1) + TSV (W2) + TSV (W3). . . + TSV (WN)
W1, W2, W3,. . . WN is a word included in the document.

データセットに対するＴＳＶの生成は、データセットにおけるキーワードを含む多くのタイプの情報、広告およびデータセットに含まれるキーワードに基づいて取り出された情報およびデータセットに割り当てられた追加の情報を利用し得る。例えば、広告に表示される単語、各広告に関連付けられた１セットのキーワード、広告のタイトル、広告の簡単な説明、広告されている品目を説明する広告に関連付けられた市場文献、またはそれが販売されている視聴者を含むが、これらに限定されない情報、および広告により参照され得るウェブサイトからの情報に基づいて、広告に対するＴＳＶの生成が行われ得る。ウェブページに対するＴＳＶの生成は、ウェブページに現れる実際のテキストまたは、タイトル、キーワードおよび説明などのウェブページに関連付けられたメタテキストフィールド、またはそのウェブページにリンクされた、もしくはそのウェブページによりリンクされた他のウェブページからのテキストなどの一部またはすべてを含むが、これらに限定されない情報に基づいて行われ得る。 Generation of a TSV for a data set may utilize many types of information including keywords in the data set, information retrieved based on advertisements and keywords contained in the data set, and additional information assigned to the data set. For example, the words that appear in the advertisement, a set of keywords associated with each advertisement, the title of the advertisement, a brief description of the advertisement, market literature associated with the advertisement that describes the item being advertised, or it is sold Generation of TSVs for advertisements may be performed based on information from, but not limited to, viewers being viewed, and information from websites that may be referenced by the advertisements. TSV generation for a web page is the actual text that appears on the web page, or a metatext field associated with the web page, such as title, keyword and description, or linked to or linked by the web page. May be based on information including, but not limited to, some or all of text from other web pages.

動作速度をより速くするために、広告に対するＴＳＶはオフラインで生成可能であり、広告が修正、追加または除去されると更新可能である。しかし、ＴＳＶはまたオプションで広告配置時に生成可能でもある。同様に、ウェブページまたは他のデータセットに対するＴＳＶはオフラインまたはオンザフライのいずれかで生成可能である。 To speed up operation, TSVs for advertisements can be generated offline and can be updated as advertisements are modified, added or removed. However, TSVs can also optionally be generated at the time of advertisement placement. Similarly, TSVs for web pages or other data sets can be generated either off-line or on-the-fly.

実施形態によると、本明細書で開示される例示的システムは、ウェブページまたは表示される文書などの種々のセクションを解析して、自動的に１つ以上の説明の各セクションを、バックグランド項目の最終整合リストに基づいて、Ｗｉｋｉｐｅｄｉａ（ｈｔｔｐ：
／／ｗｗｗ．ｗｉｋｉｐｅｄｉａ．ｏｒｇ）からの百科辞典的項目などの１セットのバックグランド項目にリンクする。 According to an embodiment, the exemplary system disclosed herein parses various sections, such as web pages or displayed documents, and automatically assigns each section of one or more descriptions to a background item. Based on the final match list of Wikipedia (http:
// www. wikipedia. org) to a set of background items such as encyclopedic items.

本明細書に開示される方法およびシステムは、１つ以上の広告を１つ以上のウェブページまたは文書に関連付ける、またはその反対、ユーザの検索クエリに基づいて関連文書を取り出す、データセットの異なる部分に対してバックグランド情報を見つけるなど、種々の目的に適用可能であることは当業者には理解される。また、本明細書で使用されるようなデータセットは、ウェブページもしくは文書などの単一タイプのデータセットのみ、またはｅメールとウェブページ、文書および放送データとの組み合わせなど、異なるタイプのデータセットの集まりを含み得ることも理解される。 The methods and systems disclosed herein associate different portions of a data set that associate one or more advertisements with one or more web pages or documents, or vice versa, and retrieve related documents based on a user search query. Those skilled in the art will appreciate that the present invention is applicable to various purposes, such as finding background information. Also, a data set as used herein may be a single type of data set such as a web page or document, or a different type of data set such as a combination of email and web page, document and broadcast data. It is also understood that a collection of

本開示による別の実施形態は、「タグ付きキー」と称される精密な表現を利用して、広告１２およびウェブページ１１などのデータセットを表したり、インデックスを付けたりする。タグ付きキーは、データセットで見つけられたキーワードを、データセットに適用可能な１つ以上の特定意味カテゴリに関連付ける。例えば、用語“ｂａｎｋ”は多くの異なる意味を持ち得るが、ＦｉｎａｎｃｉａｌＩｎｓｔｉｔｕｔｉｏｎなどの意味カテゴリでタグ付けされると、ＧｅｏｌｏｇｉｃａｌＳｔｒｕｃｔｕｒｅなどの意味カテゴリでタグ付けされた“ｂａｎｋ”を整合させることはもはやない。 Another embodiment according to the present disclosure utilizes a precise representation referred to as a “tagged key” to represent or index data sets such as advertisements 12 and web pages 11. A tagged key associates keywords found in the dataset with one or more specific semantic categories applicable to the dataset. For example, the term “bank” can have many different meanings, but when tagged with a semantic category such as Financial Institution, it no longer matches a “bank” tagged with a semantic category such as Geologic Structure. .

ウェブページ１１または広告１２などのデータセットを解析する場合、図３に対してすでに論じたようにキーワード選択器１１５または１０６により、ウェブページまたは広告を表しているとみなされる候補キーワードが、各広告またはウェブページ１１から選択される。１つの実施形態では、候補キーワードは、特定データセットまたは文書に現れる各キーワードの頻度に基づいて選択され得る。本開示による例示的システムは、所定意味カテゴリとそれらの候補キーワードとの関係に関する情報の意味辞書にアクセスする。例えば、Ｎ個の候補キーワードとＭ個の所定カテゴリを有するデータセットに関しては、Ｍ×Ｎ対のキーワードとカテゴリ（可能性のあるタグ付きキー）が有効である。フィルタを使用して、キーワードにあまり関連のないカテゴリを取り除き得る。関連の最低必要条件を特定する閾値を使用して、キーワードに十分関連するカテゴリを識別し得る。キーワードに対してカテゴリを選択する１つの例示的方法は、単に上記で論じたように意味辞書を引くことであり、この辞書には、与えられた意味カテゴリに関してどの位強力に特定用語が選択するかを特定する情報を含む。１つの実施形態では、キーワードに対して最も強力に選択されるカテゴリは、タグ付けの主要候補となるであろう。 When analyzing a data set such as web page 11 or advertisement 12, candidate keywords that are considered to represent a web page or advertisement by keyword selector 115 or 106 as previously discussed for FIG. Alternatively, it is selected from the web page 11. In one embodiment, candidate keywords may be selected based on the frequency of each keyword appearing in a particular data set or document. An exemplary system according to the present disclosure accesses a semantic dictionary of information regarding the relationship between predetermined semantic categories and their candidate keywords. For example, for a data set having N candidate keywords and M predetermined categories, M × N pairs of keywords and categories (possible tagged keys) are valid. Filters can be used to remove categories that are not very relevant to keywords. A threshold that identifies the relevant minimum requirements may be used to identify categories that are sufficiently relevant to the keyword. One exemplary way of selecting a category for a keyword is simply to look up a semantic dictionary as discussed above, and this dictionary selects how powerful specific terms are for a given semantic category. Contains information that identifies In one embodiment, the category most strongly selected for the keyword will be the primary candidate for tagging.

例えば、文書が２つのキーワードＫ１およびＫ２を含むと仮定する。そのとき、もしあるとしたらどのカテゴリがどのキーワードにつながってしているかを見るために、意味辞書でＫ１とＫ２を調べるであろう。キーワードが、カテゴリＣ１、Ｃ２、Ｃ３およびＣ４などの２つ以上のカテゴリに関係付けられている場合、いくつかのオプションがある。すなわち、（１）キーワードに対して最も強力なつながりを有するカテゴリを選ぶ、（２）最小閾値を越えるつながりを有するすべてのカテゴリを選ぶ、または（３）つながりの強度に関係なくすべてのカテゴリを選ぶ。結果は、データセットを表すための、Ｋ１＋Ｃ１，Ｋ２＋Ｃ２およびＫ２＋Ｃ４などの対になったカテゴリとキーワード、タグ付きキーのリストになる。各タグ付きキーはキーワードに対応する意味ベクトルとみなされ得、候補キーワードの意味ベクトルは、ベクトル加法などにより組み合わされて、データセットの意味ベクトル表現を形成し得る。意味ベクトル表現は、本開示で説明されるのと同様に使用され得る。 For example, assume that a document contains two keywords K1 and K2. At that time, one would look up K1 and K2 in the semantic dictionary to see which categories, if any, are connected to which keywords. If a keyword is associated with more than one category, such as categories C1, C2, C3 and C4, there are several options. That is, (1) select the category that has the strongest connection to the keyword, (2) select all categories that have connections that exceed the minimum threshold, or (3) select all categories regardless of the strength of the connection. . The result is a list of paired categories, keywords, and tagged keys such as K1 + C1, K2 + C2, and K2 + C4 to represent the data set. Each tagged key can be considered a semantic vector corresponding to the keyword, and the semantic vectors of the candidate keywords can be combined, such as by vector addition, to form a semantic vector representation of the data set. A semantic vector representation may be used as described in this disclosure.

図７は本開示の例示的システムが実装され得るコンピュータシステム１００を図示するブロック図である。コンピュータシステム１００は、バス７０２または情報を通信するための他の通信メカニズムと、情報を処理するためのバス７０２に連結されたプロセッサ７
０４とを含む。コンピュータシステム１００はまた、プロセッサ７０４により実行される情報および命令を記憶するためにバス７０２に連結された、ランダムアクセスメモリ（ＲＡＭ）または他の動的記憶デバイスなどのメインメモリ７０６も含む。メインメモリ７０６はまた、プロセッサ７０４により実行される命令の実行時に一時的可変または他の中間情報を記憶するのにも使用され得る。コンピュータシステム１００はさらに、プロセッサ７０４用の静的情報および命令を記憶するために、バス７０２に連結された読み出し専用メモリ（ＲＯＭ）７０８または他の静的記憶デバイスを含む。情報および命令を記憶するために、磁気ディスクまたは光学ディスクなどの記憶デバイス７１０が設けられて、バス７０２に連結される。 FIG. 7 is a block diagram that illustrates a computer system 100 upon which an exemplary system of the present disclosure may be implemented. The computer system 100 includes a bus 702 or other communication mechanism for communicating information and a processor 7 coupled to the bus 702 for processing information.
04. Computer system 100 also includes main memory 706, such as random access memory (RAM) or other dynamic storage device, coupled to bus 702 for storing information and instructions executed by processor 704. Main memory 706 may also be used to store temporary variables or other intermediate information during execution of instructions executed by processor 704. Computer system 100 further includes a read only memory (ROM) 708 or other static storage device coupled to bus 702 for storing static information and instructions for processor 704. A storage device 710, such as a magnetic disk or optical disk, is provided and coupled to the bus 702 for storing information and instructions.

コンピュータシステム１００は、コンピュータユーザに情報を表示するために、バス７０２を介して、陰極線管（ＣＲＴ）などのディスプレイ７１２に連結され得る。情報およびコマンド選択をプロセッサ７０４に伝えるために、英数字および他のキーを含む入力デバイス７１４がバス７０２に連結される。別のタイプのユーザ入力デバイスは、方向情報および命令選択をプロセッサ７０４に伝えるための、またディスプレイ７１２上のカーソルの移動を制御するための、マウス、トラックボールまたはカーソル方向キーなどのカーソル制御７１６である。この入力デバイスは典型的には、デバイスが平面の位置を指定するのを可能とする、第１の軸（例えば、ｘ）および第２の軸（例えば、ｙ）の２つの軸において２つの自由度を有する。 Computer system 100 may be coupled via bus 702 to a display 712, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 714 that includes alphanumeric characters and other keys is coupled to the bus 702 to communicate information and command selections to the processor 704. Another type of user input device is a cursor control 716, such as a mouse, trackball or cursor direction key, for communicating direction information and instruction selections to the processor 704 and for controlling the movement of the cursor on the display 712. is there. This input device typically has two free axes in two axes, a first axis (eg, x) and a second axis (eg, y) that allows the device to specify the position of the plane. Have a degree.

本開示の１つの実施形態によると、ＴＳＶおよび意味演算の構築は、メインメモリ７０６または記憶デバイス７１０に含まれるか、またはネットワークリンク１２０から受信される１つ以上の命令の１つ以上のシーケンスを実行するプロセッサ７０４に応答して、コンピュータシステム１００により提供される。かかる命令は、記憶デバイス７１０などの別のコンピュータ読み取り可能媒体からメインメモリ７０６に読み取られ得る。メインメモリ７０６に含まれる命令のシーケンスの実行により、プロセッサ７０４は本明細書で説明される処理ステップを行う。メインメモリ７０６に含まれる命令のシーケンスを実行するのにマルチ処理配置における１つ以上のプロセッサも採用され得る。代替の実施形態では、ソフトウェア命令の代わりに、またはそれと併せて有線回路を使用して開示を実施し得る。従って、開示の実施形態はハードウェア回路およびソフトウェアの任意の特定組み合わせに限定されない。 According to one embodiment of the present disclosure, the construction of TSVs and semantic operations includes one or more sequences of one or more instructions included in main memory 706 or storage device 710 or received from network link 120. Provided by computer system 100 in response to processor 704 executing. Such instructions may be read into main memory 706 from another computer readable medium such as storage device 710. Execution of the sequence of instructions contained in main memory 706 causes processor 704 to perform the processing steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the sequence of instructions contained in main memory 706. In alternative embodiments, the disclosure may be implemented using wired circuitry instead of or in conjunction with software instructions. Thus, the disclosed embodiments are not limited to any specific combination of hardware circuitry and software.

本明細書で使用される用語「コンピュータ読み取り可能媒体」は、実行のためにプロセッサ７０４に命令を与えるのに参与する任意の媒体を示す。かかる媒体は、不揮発性媒体、揮発性媒体および伝達媒体を含むが、これらに限定されない、多くの形態を取り得る。不揮発性媒体は、例えば記録デバイス７１０などの光学または磁気ディスクを含む。揮発性媒体は、メインメモリ７０６などの動的メモリを含む。伝達媒体は、バス７０２を構成するワイヤを含む、同軸ケーブル、銅線、および光ファイバを含む。コンピュータ読み取り可能媒体の一般的な形態は、例えばフロッピー（登録商標）ディスク、フレキシブルディスク、ハードディスク、磁気テープ、任意の他の磁気媒体、ＣＤ−ＲＯＭ、ＤＶＤ、任意の他の光学媒体、パンチカード、ペーパーテープ、孔のパターンを有する任意の他の物理的媒体、ＲＡＭ、ＰＲＯＭ，およびＥＰＲＯＭ、ＦＬＡＳＨ−ＥＰＲＯＭ、任意の他のメモリチップもしくはカートリッジ、またはコンピュータが読み取ることができる任意の他の媒体を含む。 The term “computer-readable medium” as used herein refers to any medium that participates in providing instructions to processor 704 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as recording device 710. Volatile media includes dynamic memory, such as main memory 706. Transmission media includes coaxial cables, copper wire, and optical fibers, including the wires that make up bus 702. Common forms of computer readable media include, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, DVD, any other optical medium, punch card, Includes paper tape, any other physical media with a pattern of holes, RAM, PROM, and EPROM, FLASH-EPROM, any other memory chip or cartridge, or any other media that can be read by a computer .

コンピュータ読み取り可能媒体の種々の形態が、実行のために１つ以上の命令の１つ以上のシーケンスをプロセッサ７０４に運ぶのに携わり得る。例えば、命令は最初は遠隔コンピュータの磁気ディスク上にあり得る。遠隔コンピュータは命令をその動的メモリ内にロードして、モデムを使用して電話回線で命令を送信することができる。コンピュータシステム１００に対してローカルなモデムは、電話回線でデータを受信して、赤外線送信機
を使用してデータを赤外線信号に変換することができる。バス７０２に連結された赤外線検出器は、赤外線信号で運ばれたデータを受信して、データをバス７０２上に配置することができる。バス７０２はデータを、プロセッサ７０４がそこから命令を取り出して実行するメインメモリ７０６に運ぶ。メインメモリ７０６により受信された命令は、プロセッサ７０４による実行の前または後のいずれかに、オプションで記憶デバイス７１０上に記憶され得る。 Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 704 for execution. For example, the instructions may initially be on a remote computer's magnetic disk. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 100 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infrared detector coupled to bus 702 can receive data carried in the infrared signal and place the data on bus 702. Bus 702 carries the data to main memory 706 from which processor 704 retrieves and executes instructions. The instructions received by main memory 706 may optionally be stored on storage device 710 either before or after execution by processor 704.

コンピュータシステム１００はまた、バス７０２に連結された通信インタフェース７１８を含む。通信インタフェース７１８は、ローカルネットワーク７２２に接続されたネットワークリンク１２０に連結する２方向データ通信を提供する。例えば、通信インタフェース７１８は、統合サービスデジタルネットワーク（ＩＳＤＮ）カードまたはモデムであり、対応するタイプの電話回線へのデータ通信接続を提供し得る。別の例としては、通信インタフェース７１８は、ローカルエリアネットワーク（ＬＡＮ）カードであり、互換性のあるＬＡＮへのデータ通信接続を提供し得る。無線リンクも実行され得る。任意のかかる実行では、通信インタフェース７１８は、種々のタイプの情報を表すデジタルデータストリームを運ぶ電気、電磁または光学信号を送受信する。 Computer system 100 also includes a communication interface 718 coupled to bus 702. Communication interface 718 provides two-way data communication coupling to network link 120 connected to local network 722. For example, communication interface 718 may be an integrated services digital network (ISDN) card or modem and may provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 718 may be a local area network (LAN) card and provide a data communication connection to a compatible LAN. A radio link may also be implemented. In any such implementation, communication interface 718 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

ネットワークリンク１２０は典型的には、１つ以上のネットワークを介する他のデータデバイスへのデータ通信を提供する。例えば、ネットワークリンク１２０は、ローカルネットワーク７２２を介したホストコンピュータ７２４への、またはインターネットサービスプロバイダ（ＩＳＰ）７２６によって運営されるデータ機器への接続を提供し得る。ＩＳＰ７２６は今度は、現在では一般的に「インターネット」７２８と称される世界規模パケットデータ通信ネットワークを介してデータ通信サービスを提供する。ローカルネットワーク７２２およびインターネット７２８は共に、デジタルデータストリームを運ぶ電機、電磁または光学信号を使用する。コンピュータシステム１００に、およびコンピュータシステム１００からデジタルデータを運ぶ、種々のネットワークを介した信号およびネットワークリンク１２０上で通信インタフェース７１８を介した信号は、情報を運搬する搬送波の例示的形態である。 Network link 120 typically provides data communication to other data devices over one or more networks. For example, the network link 120 may provide a connection to a host computer 724 via a local network 722 or to a data device operated by an Internet service provider (ISP) 726. ISP 726 in turn provides data communication services through a global packet data communication network now commonly referred to as the “Internet” 728. Local network 722 and Internet 728 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals over the various networks that carry digital data to and from the computer system 100 and through the communication interface 718 on the network link 120 are exemplary forms of carrier waves that carry information.

コンピュータシステム１００は、メッセージを送信したり、ネットワーク、ネットワークリンク１２０および通信インタフェース７１８を介して、プログラムコードを含むデータを受信したりできる。インターネットの例では、サーバ１３０がインターネット７２８、ＩＳＰ７２６、ローカルネットワーク７２２および通信インタフェース７１８を介して、アプリケーションプログラムに対する要求コードを送信し得る。開示によると、１つのかかるダウンロードされたアプリケーションは、本明細書に説明されるように、ＴＳＶの構築と種々の意味演算の実行を提供する。受信コードは、それが受信されたようにプロセッサ７０４により実行され、かつ／または後の実行のために記憶デバイス７１０もしくは他の不揮発性記憶部に記憶され得る。このように、コンピュータシステム１００は、搬送波の形態のアプリケーションコードを取得し得る。 The computer system 100 can send messages and receive data including program codes via the network, the network link 120 and the communication interface 718. In the Internet example, server 130 may send a request code for an application program via Internet 728, ISP 726, local network 722 and communication interface 718. According to the disclosure, one such downloaded application provides TSV construction and execution of various semantic operations, as described herein. The received code may be executed by processor 704 as it is received and / or stored in storage device 710 or other non-volatile storage for later execution. In this way, the computer system 100 can obtain application code in the form of a carrier wave.

前述の説明では、本開示の完全な理解を提供するために、具体的材料、構造、処理など多くの具体的詳細が述べられている。しかしながら、当該技術分野において通常の技術を有する人なら認識するように、本開示は具体的に述べられた詳細に頼らずに実践可能である。他の例では、不必要に本開示を曖昧にしないように、周知の処理構造は詳細に説明されていない。 In the preceding description, numerous specific details are set forth such as specific materials, structures, processes, etc., in order to provide a thorough understanding of the present disclosure. However, as will be appreciated by those having ordinary skill in the art, the present disclosure may be practiced without resorting to the details specifically set forth. In other instances, well known processing structures have not been described in detail in order not to unnecessarily obscure the present disclosure.

本開示では開示の例示的実施形態およびそれらの多用途の例のみが示され、説明される。本開示は種々の他の組み合わせおよび環境での使用が可能であり、本明細書で表される発明概念の範囲内の変更または変形が可能であることは理解されるべきである。 Only the exemplary embodiments of the disclosure and their versatile examples are shown and described in this disclosure. It is to be understood that the present disclosure can be used in various other combinations and environments, and that changes or modifications within the scope of the inventive concept presented herein are possible.

Claims

A machine-implemented method for controlling a data processing system for relating at least one data set from a group of data sets to a target data set, wherein each data set or the target data set has at least one keyword And the method is machine-implemented:
Accessing the semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group;
Each semantic vector representing each respective data set in the group is between each of the at least one keyword in the respective data set and a predetermined category that can be related to each of the at least one keyword in the respective data set. Including collective information about the relationship
The semantic vector representing the target data set includes collective information on a relationship between the at least one keyword in the target data set and a predetermined category to which each of the at least one keyword in the target data set can relate. Including
The semantic vector representing each respective dataset in the target dataset or group has a dimension equal to the number of the predetermined categories;
For each data set in the group, by comparing the semantic vector associated with the target data set with the semantic vector associated with each data set in the group, each data in the target data set and the group Determining a first similarity between sets;
Accessing the keyword semantic representation of the target dataset and the keyword semantic representation of each respective dataset in the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group includes information representing a representative keyword of each data set in the target data set or the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group is different from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Built in
For each data set in the group, by comparing the keyword semantic representation of the target data set with the keyword semantic representation of each data set in the group, between the target data set and each data set in the group Determining a second similarity of
The data set in the group according to the first similarity between the target data set and each data set in the group and the second similarity between the target data set and each data set in the group. Selecting at least one of:
Associating the at least one selected data set in the group with the target data set;
Including a method.

At least one of the data sets in the group is an advertisement, and the target data set is a document, web page, email, RSS news feed, data stream, broadcast data or information about a user, or one or more documents, web The method of claim 1, wherein the method is part of a page, email, RSS news feed, data stream, broadcast data or information about a user.

The method of claim 1, wherein the target data set is part of a document, web page, email, RSS news feed, data stream, broadcast data, or user information.

The method further comprises: communicating to the user the at least one selected data set or a file associated with the selected data set with respect to the target data or a file associated with the target data set. The method described in 1.

The at least one selected data set displays the at least one selected data set, reproduces an acoustic signal according to the at least one selected data set, or the at least one selected data The method of claim 4, wherein the method is communicated to the user by providing a link to a set.

The at least one keyword is at least one of a word, a phrase, a character string, a pre-assigned keyword, a sub data set, meta information, and information extracted based on a link included in the respective data set. The method of claim 1 comprising:

The method of claim 1, wherein the semantic vector for each data set is pre-calculated and included in the respective data set.

The method of claim 1, wherein the semantic vector is dynamically generated.

The semantic vector representing each respective data set in the group is a known vector between at least one keyword of each respective data set in the group and a predetermined category to which the known keyword can relate. Built on relationships,
The semantic vector representing the target data set is constructed based on the known relationship between at least one keyword of the target data set and a known keyword and a predetermined category to which the known keyword can relate. The method of claim 1.

The semantic vector associated with the respective data set is further generated based on information about at least one user or at least one data set linked to the respective data set. Method.

The method of claim 10, wherein the information about the at least one user includes at least one of previously viewed documents, previous search requests, user preferences, and personal information.

According to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group, at least of the data sets in the group The step of selecting one is
Designating one of the first similarity and the second similarity as primary similarity and the other as secondary similarity;
Accessing a plurality of preset related level information for the primary similarity;
For each data set in the group, mapping the primary similarity to one of the preset related levels according to the primary similarity;
Ranking the datasets in the group according to each mapped preset association level of the datasets in the group;
Within each association level, ranking the dataset to each association level according to the secondary similarity of the dataset;
Selecting the at least one of the data sets in the group according to a result of ranking the data set to each associated level;
The method of claim 1 comprising:

The data in the group according to the first similarity between the target data set and each data set in the group and the second similarity between the target data set and each data set in the group. The step of selecting at least one of the sets is
Designating one of the first similarity and the second similarity as primary similarity and the other as secondary similarity;
Ranking the data sets in the group according to the primary similarity;
Selecting at least one candidate data set from the ranked data sets according to preset criteria;
Ranking the at least one candidate data set according to the secondary similarity;
Selecting the at least one of the data sets in the group according to a result of ranking the at least one candidate data set;
The method of claim 1 comprising:

The data set in the group according to a first similarity between the target data set and each data set in the group and a second similarity between the target data set and each data set in the group The step of selecting at least one of
For each data set in the group, calculating a composite similarity according to a preset formula based on a respective first similarity of the data set and a respective second similarity of the data set;
Selecting the at least one of the data sets in the group according to a respective composite similarity of the data sets based on preset criteria;
The method of claim 1 comprising:

The method of claim 1, further comprising presenting the at least one of the data sets to a user simultaneously with the target data set.

The method of claim 1, further comprising presenting the at least one of the data sets to the user subsequent to presenting the target data set to the user.

The method of claim 1, wherein the at least one of the data sets or the target data set is presented to the user in an auditory form, a visual form, a video form, a tactile form, or any combination thereof.

A data processing system for associating at least one data set from a group of data sets with a target data set, wherein each data set or the target data set includes at least one keyword, the system including data A data processor configured to process;
A data storage system configured to store instructions for controlling the data processor to perform the following steps when executed by the data processor;
Including
The step includes
Accessing the semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group;
Each semantic vector representing each respective data set in the group is between each of the at least one keyword in the respective data set and a predetermined category to which each of the at least one keyword in the respective data set can relate. Including collective information about the relationship
The semantic vector representing the target data set includes collective information of a relationship between the at least one keyword in the target data set and a predetermined category to which each of the at least one keyword in the target data set can be related. ,
The semantic vector representing each respective dataset in the target dataset or group has a dimension equal to the number of the predetermined categories;
For each data set in the group, by comparing the semantic vector associated with the target data set with the semantic vector associated with each data set in the group, each data in the target data set and the group Determining a first similarity between sets;
Accessing the keyword semantic representation of the target dataset and the keyword semantic representation of each respective dataset in the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group includes information representing a representative keyword of the respective data set in the target data set or the group,
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group is different from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Built in
For each data set in the group, by comparing the keyword semantic representation of the target data set with the keyword semantic representation of each data set in the group, between the target data set and each data set in the group Determining a second similarity of
The data set in the group according to the first similarity between the target data set and each data set in the group and the second similarity between the target data set and each data set in the group. Selecting at least one of:
Associating the at least one selected data set in the group with the target data set;
Including the system.

A machine-readable medium carrying instructions for controlling a data processing system to perform machine-executed steps and relate at least one data set from a group of data sets to a target data set during execution of the data processing system. Each data set or the target data set includes at least one keyword, and the step includes:
Accessing the semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group;
Each semantic vector representing each respective data set in the group is between each of the at least one keyword in the respective data set and a predetermined category to which each of the at least one keyword in the respective data set can relate. Including collective information about the relationship
The semantic vector representing the target data set includes collective information on a relationship between each of the at least one keyword in the target data set and a predetermined category to which each of the at least one keyword in the target data set can relate. Including
The semantic vector representing each respective dataset in the target dataset or group has a dimension equal to the number of the predetermined categories;
For each data set in the group, by comparing the semantic vector associated with the target data set with the semantic vector associated with each data set in the group, each data in the target data set and the group Determining a first similarity between sets;
Accessing the keyword semantic representation of the target dataset and the keyword semantic representation of each respective dataset in the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group includes information representing a representative keyword of each data set in the target data set or the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group is different from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Built in
For each data set in the group, by comparing the keyword semantic representation of the target data set with the keyword semantic representation of each data set in the group, between the target data set and each data set in the group Determining a second similarity of
The data set in the group according to the first similarity between the target data set and each data set in the group and the second similarity between the target data set and each data set in the group. Selecting at least one of:
Associating the at least one selected data set in the group with the target data set;
Including medium.

A machine-implemented method for controlling a data processing system for associating at least one data set from a group of data sets with a target data set, wherein each data set or the target data set is at least 1 Including two keywords, the machine-executed step comprises:
Accessing the semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group;
Each semantic vector representing each respective data set in the group is between each of the at least one keyword in the respective data set and a predetermined category to which each of the at least one keyword in the respective data set can relate. Including collective information about the relationship
The semantic vector representing the target data set includes collective information on a relationship between each of the at least one keyword in the target data set and a predetermined category to which each of the at least one keyword in the target data set can relate. Including
The semantic vector representing each respective dataset in the target dataset or group has a dimension equal to the number of the predetermined categories;
Accessing the keyword semantic representation of the target dataset and the keyword semantic representation of each respective dataset in the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group includes information representing a representative keyword of the respective data set in the target data set or the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group is different from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Built in
Generating, for each data set, a combined vector representation of the data set according to the semantic vector associated with each data set and the keyword semantic representation of each data set;
Generating, for the target data set, a combined vector representation of the target data set according to the semantic vector associated with the target data set and the keyword semantic representation of the target data set;
Determining similarity between the target data set and each data set in the group by comparing the combined vector representation of the target data set and the combined vector representation of each data set in the group; ,
Selecting at least one of the data sets in the group according to the determined similarity;
Associating the at least one selected data set in the group with the target data set;
Including a method.

A machine-readable medium carrying instructions for controlling a data processing system to perform machine-executed steps and relate at least one data set from a group of data sets to a target data set during execution of the data processing system. Each data set or the target data set includes at least one keyword, and the step includes:
Accessing the semantic vector representing the target data set and a respective semantic vector representing each respective data set in the group;
Each semantic vector representing each respective data set in the group is between each of the at least one keyword in the respective data set and a predetermined category to which each of the at least one keyword in the respective data set can relate. Including collective information about the relationship
The semantic vector representing the target data set includes collective information on a relationship between each of the at least one keyword in the target data set and a predetermined category to which each of the at least one keyword in the target data set can relate. Including
The semantic vector representing each respective dataset in the target dataset or group has a dimension equal to the number of the predetermined categories;
Accessing the keyword semantic representation of the target dataset and the keyword semantic representation of each respective dataset in the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group includes information representing a representative keyword of each data set in the target data set or the group;
The keyword semantic representation of the target data set or the keyword semantic representation of each respective data set in the group is different from the semantic vector of the target data set or the semantic vector of each respective data set in the group. Built in
Generating, for each data set, a combined vector representation of the data set according to the semantic vector associated with each data set and the keyword semantic representation of each data set;
Generating, for the target data set, a combined vector representation of the target data set according to the semantic vector associated with the target data set and the keyword semantic representation of the target data set;
Determining similarity between the target data set and each data set in the group by comparing the combined vector representation of the target data set and the combined vector representation of each data set in the group; ,
Selecting at least one of the data sets in the group according to the determined similarity;
Associating the at least one selected data set in the group with the target data set;
Including medium.

A machine-implemented method for controlling a data processing system for relating at least one data set from a group of data sets to a target data set, wherein each data set or the target data set has at least one keyword And the machine-executed steps include:
Accessing a tagged key representation representing the target data set and a respective tagged key representation representing each respective data set in the group;
Each tagged key representation representing each respective data set in the group has a relationship between each representative keyword of each respective data set and a predetermined category to which each said representative keyword in each respective data set can be related. Including collective information,
The tagged key representation representing the target data set includes collective information of a relationship between each representative keyword in the target data set and a predetermined category to which each representative keyword in the target data set may be related,
For each data set in the group, comparing the tagged key representation associated with the target data set with the tagged key representation associated with each data set in the group, and Determining the similarity between each dataset in the group;
Selecting at least one of the data sets in the group according to the determined similarity between the target data set and each data set in the group;
Associating the at least one selected data set in the group with the target data set;
Including a method.

When executing a data processing system, machine-readable, carrying instructions that control the data processing system to perform machine-executed steps to relate at least one data set from a group of data sets to a target data set A medium, wherein each data set or the target data set includes at least one keyword, the step comprising:
Accessing a tagged key representation representing the target data set and a respective tagged key representation representing each respective data set in the group;
Each tagged key representation representing each respective data set in the group has a relationship between each representative keyword of each respective data set and a predetermined category to which each said representative keyword in each respective data set can be related. Including collective information,
The tagged key representation representing the target data set includes collective information of a relationship between each representative keyword in the target data set and a predetermined category to which each representative keyword in the target data set may be related,
For each data set in the group, comparing the tagged key representation associated with the target data set with the tagged key representation associated with each data set in the group, and Determining the similarity between each dataset in the group;
Selecting at least one of the data sets in the group according to the determined similarity between the target data set and each data set in the group;
Associating the at least one selected data set in the group with the target data set;
Including medium.

A machine-implemented method for controlling a data processing system to generate a tagged representation of a data set that includes at least one keyword, the method comprising:
Identifying a representative keyword from the at least one keyword to represent the data set;
Accessing data identifying a known relationship between each known keyword and a given category;
Determining a relationship between each representative keyword and the predetermined category by referring to the accessed data;
Building a tagged key representation of the data set according to the determined relationship between each representative keyword and the predetermined category;
Representing the data set using the constructed tagged key representation;
Including a method.

When executing a data processing system, machine-readable, carrying instructions that control the data processing system to perform machine-executed steps to relate at least one data set from a group of data sets to a target data set A medium, wherein each data set or the target data set includes at least one keyword, the step comprising:
Identifying a representative keyword from the at least one keyword to represent the data set;
Accessing data identifying a known relationship between each known keyword and a given category;
Determining a relationship between each representative keyword and the predetermined category by referring to the accessed data;
Building a tagged key representation of the data set according to the determined relationship between each representative keyword and the predetermined category;
Representing the data set using the constructed tagged key representation;
Including medium.