JP5020274B2

JP5020274B2 - Semantic drift occurrence evaluation method and apparatus

Info

Publication number: JP5020274B2
Application number: JP2009041832A
Authority: JP
Inventors: 慎平牧本
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2009-02-25
Filing date: 2009-02-25
Publication date: 2012-09-05
Anticipated expiration: 2029-02-25
Also published as: JP2010198269A

Description

本発明は、意味ドリフトの発生評価方法及び装置に関する。 The present invention relates to a semantic drift generation evaluation method and apparatus.

従来、人名、地名、組織名といった固有表現や、これらの関係に関する意味知識の獲得方法として、初期値であって所定のカテゴリに属する少量のシードインスタンスに基づいて同一カテゴリに属するインスタンスを抽出し、抽出されたインスタンスにより反復的にインスタンスを増やしていく方法であるブートストラップアルゴリズムが提案されている（例えば、非特許文献１）。ブートストラップアルゴリズムでは、一般的なテキスト文章から、インスタンスを抽出することが可能である。より具体的には、検索ログをインスタンスの獲得源とした場合に、検索ログに含まれる検索クエリを構成する複数の単語のうち、所定のカテゴリに属する単語をシードインスタンスとする。そして、シードインスタンスを含む検索クエリにおいて、シードインスタンス以外の文字列をパターンとして抽出し、抽出したパターンより、インスタンスを抽出する。そして、高い適合率で所定のカテゴリに属するインスタンスを獲得できるものには高い適合度を割り当て、無関係のインスタンスを獲得できるものには、低い適合度を割り当てる。そして、適合度の高い順にパターンを用いることにより、高い適合率でインスタンスを獲得する。さらに、上述のシードインスタンスからインスタンスを獲得する方法と同じ要領で、獲得したインスタンスからの新たなインスタンスを獲得する処理を反復して実行する。このように、ブートストラップアルゴリズムは、少量のシードインスタンスから、大量のインスタンスを獲得できるという利点がある。 Conventionally, as an acquisition method of specific knowledge such as a person name, place name, organization name, and semantic knowledge regarding these relationships, an instance belonging to the same category is extracted based on a small number of seed instances that are initial values and belong to a predetermined category, A bootstrap algorithm, which is a method of repeatedly increasing instances by extracted instances, has been proposed (for example, Non-Patent Document 1). In the bootstrap algorithm, an instance can be extracted from a general text sentence. More specifically, when a search log is used as an instance acquisition source, a word belonging to a predetermined category among a plurality of words constituting a search query included in the search log is set as a seed instance. Then, in the search query including the seed instance, a character string other than the seed instance is extracted as a pattern, and the instance is extracted from the extracted pattern. Then, a high fitness is assigned to those that can acquire instances belonging to a predetermined category with a high fitness, and a low fitness is assigned to those that can acquire irrelevant instances. Then, by using the patterns in descending order of the matching degree, the instances are acquired with a high matching ratio. Further, the process of acquiring a new instance from the acquired instance is repeatedly executed in the same manner as the method of acquiring an instance from the seed instance described above. Thus, the bootstrap algorithm has an advantage that a large number of instances can be obtained from a small amount of seed instances.

小町守，鈴木久美，検索ログからの半教師あり意味知識獲得の改善，人工知能学会論文誌，2008，No.3，pp.217-225，インターネット<URL: http://www.jstage.jst.go.jp/article/tjsai/23/3/217/_pdf/-char/ja/>Mamoru Komachi, Kumi Suzuki, Improvement of semi-supervised acquisition of semantic knowledge from search logs, Journal of the Japanese Society for Artificial Intelligence, 2008, No.3, pp.217-225, Internet <URL: http: //www.jstage.jst .go.jp / article / tjsai / 23/3/217 / _pdf / -char / en />

しかしながら、非特許文献１に記載の方法では、抽出されたパターンが、複数のカテゴリに出現するパターンであるジェネリックパターンである可能性がある。ジェネリックパターンは、所定のカテゴリ以外のカテゴリとも共起するパターンであることから、非特許文献１に記載の方法では、所定のカテゴリに属するシードインスタンスと関係のないインスタンスを獲得しうる。そして、シードインスタンスと関係のないインスタンスを一度獲得すると、所定のカテゴリと関係のないインスタンスと関連性の高いインスタンスを獲得するパターンを抽出して、獲得するインスタンスが所定のカテゴリと関連性の低いものに変わってしまう意味ドリフト（ｓｅｍａｎｔｉｃｄｒｉｆｔ）が発生しうる問題がある。さらに、ジェネリックパターンの有無に関わらず、複数のカテゴリに属しうる曖昧なインスタンスを獲得してしまう場合も、意味ドリフトが発生しうる。 However, in the method described in Non-Patent Document 1, there is a possibility that the extracted pattern is a generic pattern that is a pattern that appears in a plurality of categories. Since the generic pattern is a pattern that co-occurs with a category other than the predetermined category, the method described in Non-Patent Document 1 can acquire an instance unrelated to the seed instance belonging to the predetermined category. Once an instance that is not related to the seed instance is acquired, a pattern for acquiring an instance that is highly related to an instance that is not related to the predetermined category is extracted, and the acquired instance is less related to the predetermined category There is a problem that a semantic drift may occur. Furthermore, semantic drift can also occur when an ambiguous instance that can belong to a plurality of categories is acquired regardless of the presence or absence of a generic pattern.

本発明は、このような従来の問題点に鑑みて提案されたものであり、その目的は、意味ドリフトが発生していることを認知しうる評価方法及び装置を提供することにある。 The present invention has been proposed in view of such conventional problems, and an object thereof is to provide an evaluation method and apparatus capable of recognizing that a semantic drift has occurred.

本発明では、以下のような解決手段を提供する。 The present invention provides the following solutions.

（１）所定のカテゴリに含まれるインスタンスをブートストラップアルゴリズムにより獲得する方法において該所定のカテゴリと関連性の高いインスタンスを検索ログを用いて取得する際に、カテゴリの意味が遷移する意味ドリフトの発生状態を評価する意味ドリフト発生評価方法であって、前記検索ログに基づいて、新たなインスタンスを抽出する第１の抽出ステップと、前記ブートストラップアルゴリズムにより、前記第１の抽出ステップにて抽出した新たなインスタンスを用いた前記第１の抽出ステップの実行を反復する際に、予め記憶されている関連キーワード辞書よりインスタンスの関連キーワードを抽出する第２の抽出ステップと、前記関連キーワードに割り当てた数値を要素とする関連キーワードベクトルを生成するステップと、各反復における関連キーワードベクトルに基づいて、各反復における意味ドリフトの程度を評価するステップとを備える意味ドリフト発生評価方法。 (1) Generation of semantic drift in which the meaning of a category transitions when an instance highly relevant to the predetermined category is acquired using a search log in a method of acquiring instances included in the predetermined category by a bootstrap algorithm A semantic drift occurrence evaluation method for evaluating a state, wherein a first extraction step of extracting a new instance based on the search log, and a new extraction extracted in the first extraction step by the bootstrap algorithm A second extraction step of extracting a related keyword of an instance from a pre-stored related keyword dictionary when repeating the execution of the first extraction step using a simple instance, and a numerical value assigned to the related keyword Step for generating related keyword vectors as elements If, based on the associated keyword vector at each iteration, it means drift occurs evaluation method comprising a step of evaluating the degree of meaning drift in each iteration.

（１）記載の意味ドリフト発生評価方法によれば、検索ログに基づいて、新たなインスタンスを抽出する第１の抽出処理を行う。そして、ブートストラップアルゴリズムにより、抽出したインスタンスを用いた第１の抽出処理の実行を反復する。そして、ブートストラップアルゴリズムにより新たなインスタンスの抽出の実行を反復する際に、予め記憶されている関連キーワード辞書よりインスタンスの関連キーワードを抽出する。そして、抽出した関連キーワードに対して割り当てた数値を要素とする関連キーワードベクトルを生成する。そして、各反復における関連キーワードベクトルに基づいて、各反復における意味ドリフトの程度を評価する。 According to the semantic drift occurrence evaluation method described in (1), the first extraction process for extracting a new instance is performed based on the search log. Then, the execution of the first extraction process using the extracted instance is repeated by the bootstrap algorithm. When the execution of new instance extraction is repeated by the bootstrap algorithm, the related keyword of the instance is extracted from the related keyword dictionary stored in advance. And the related keyword vector which makes the numerical value allocated with respect to the extracted related keyword an element is produced | generated. Then, the degree of semantic drift in each iteration is evaluated based on the related keyword vector in each iteration.

このような方法によれば、各反復における関連キーワードベクトルと該反復の直前における関連キーワードベクトルとについて各反復における意味ドリフトの程度を評価するので、評価内容に基づいて意味ドリフトが発生したことを認知できる。 According to such a method, since the degree of semantic drift in each iteration is evaluated for the related keyword vector in each iteration and the related keyword vector immediately before the iteration, it is recognized that the semantic drift has occurred based on the evaluation contents. it can.

（２）前記意味ドリフトの程度を評価するステップは、前記各反復における関連キーワードベクトルと該反復の直前における関連キーワードベクトルとについてコサイン類似度を算出し、該コサイン類似度により意味ドリフトの程度を評価することを特徴とする（１）記載の意味ドリフト発生評価方法。 (2) In the step of evaluating the degree of semantic drift, a cosine similarity is calculated for the related keyword vector in each iteration and the related keyword vector immediately before the iteration, and the degree of semantic drift is evaluated based on the cosine similarity. The semantic drift generation evaluation method according to (1), characterized in that:

（２）記載の意味ドリフト発生評価方法によれば、各反復における関連キーワードベクトルと該反復の直前における関連キーワードベクトルとについてコサイン類似度を算出し、該コサイン類似度により意味ドリフトの程度を評価する。このようにすることで、各反復における関連キーワードベクトルが直前の反復からどれだけ関連キーワードが遷移したのかを計測することができる。 (2) According to the semantic drift occurrence evaluation method described in (2), the cosine similarity is calculated for the related keyword vector in each iteration and the related keyword vector immediately before the iteration, and the degree of semantic drift is evaluated based on the cosine similarity. . In this way, it is possible to measure how much the related keyword has transitioned from the previous iteration in the related keyword vector in each iteration.

（３）前記意味ドリフトの程度を評価するステップは、前記各反復における関連キーワードベクトルと前記第１の抽出ステップにより前記新たなインスタンスを抽出するときの初期値であるシードインスタンスの関連キーワードベクトルとについてコサイン類似度を算出し、該コサイン類似度により意味ドリフトの程度を評価することを特徴とする（１）記載の意味ドリフト発生評価方法。 (3) The step of evaluating the degree of semantic drift includes a related keyword vector in each iteration and a related keyword vector of a seed instance that is an initial value when the new instance is extracted in the first extraction step. The cosine similarity is calculated, and the semantic drift occurrence evaluation method according to (1), wherein the degree of semantic drift is evaluated based on the cosine similarity.

（３）記載の意味ドリフト発生評価方法によれば、各反復における関連キーワードベクトルと前記第１の抽出ステップにより前記新たなインスタンスを抽出するときの初期値であるシードインスタンスの関連キーワードベクトルとについてコサイン類似度を算出し、該コサイン類似度により意味ドリフトの程度を評価する。このようにすることで、反復により抽出されたインスタンスの関連キーワードがシードインスタンスの関連キーワードからどの程度遷移したのかを計測することができる。 According to the semantic drift occurrence evaluation method described in (3), a cosine is used for a related keyword vector in each iteration and a related keyword vector of a seed instance that is an initial value when the new instance is extracted by the first extraction step. Similarity is calculated, and the degree of semantic drift is evaluated based on the cosine similarity. By doing in this way, it is possible to measure how much the related keyword of the instance extracted by repetition has shifted from the related keyword of the seed instance.

（４）所定のカテゴリに含まれるインスタンスをブートストラップアルゴリズムにより獲得する方法において該所定のカテゴリと関連性の高いインスタンスを検索ログを用いて取得する際に、カテゴリの意味が遷移する意味ドリフトの発生状態を評価する意味ドリフト発生評価装置であって、前記検索ログに基づいて、新たなインスタンスを抽出するインスタンス抽出手段と、前記ブートストラップアルゴリズムにより、前記インスタンス抽出手段にて抽出した新たなインスタンスを用いた前記インスタンス抽出手段の実行を反復する反復実行制御手段と、予め記憶されている関連キーワード辞書よりインスタンスの関連キーワードを抽出する関連キーワード抽出手段と、前記関連キーワードに割り当てた数値を要素とする関連キーワードベクトルを生成するベクトル生成手段と、前記反復のそれぞれにおける関連キーワードベクトルに基づいて、該反復のそれぞれにおける意味ドリフトの程度を評価する意味ドリフト評価手段とを備える意味ドリフト発生評価装置。 (4) Generation of semantic drift in which the meaning of a category changes when an instance highly relevant to the predetermined category is acquired using a search log in a method of acquiring instances included in the predetermined category by a bootstrap algorithm A semantic drift occurrence evaluation device for evaluating a state, wherein an instance extraction unit for extracting a new instance based on the search log and a new instance extracted by the instance extraction unit by the bootstrap algorithm are used. Repetitive execution control means for repeating the execution of the instance extracting means, related keyword extracting means for extracting a related keyword of an instance from a pre-stored related keyword dictionary, and a relation having a numerical value assigned to the related keyword as an element keyword A vector generation means for generating a vector, based on the associated keyword vector in each of the iterations, the meaning drift generating evaluation device and a means drift evaluating means for evaluating the degree of meaning drift in each of the iterations.

このような構成によれば、当該装置を構築することにより、（１）と同様の効果が期待できる。 According to such a configuration, the same effect as in (1) can be expected by constructing the device.

本発明によれば、意味ドリフトが発生していることを認知しうる評価方法及び装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the evaluation method and apparatus which can recognize that the semantic drift has generate | occur | produced can be provided.

本実施形態に係る意味ドリフト発生評価装置１の構成例を示す図である。It is a figure which shows the structural example of the meaning drift generation | occurrence | production evaluation apparatus 1 which concerns on this embodiment. 本実施形態に係る検索ログＤＢ２１を示す図である。It is a figure showing search log DB21 concerning this embodiment. 本実施形態に係る関連キーワード辞書ＤＢ２２を示す図である。It is a figure which shows the related keyword dictionary DB22 which concerns on this embodiment. 本実施形態に係る意味ドリフト発生評価装置１のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the meaning drift generation | occurrence | production evaluation apparatus 1 which concerns on this embodiment. 本実施形態に係る意味ドリフト発生評価装置１が行う処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which the semantic drift generation | occurrence | production evaluation apparatus 1 which concerns on this embodiment performs. 本実施形態に係る制御部１０のインスタンス抽出部１１におけるインスタンス抽出の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process of instance extraction in the instance extraction part 11 of the control part 10 which concerns on this embodiment. シード類似度及び差分類似度の計測結果を示す図である。It is a figure which shows the measurement result of a seed similarity and a difference similarity.

以下、本発明の実施形態について図を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［全体構成］
図１は、本実施形態に係る意味ドリフト発生評価装置１の構成例を示す図である。 [overall structure]
FIG. 1 is a diagram illustrating a configuration example of a semantic drift occurrence evaluation device 1 according to the present embodiment.

意味ドリフト発生評価装置１は、所定のカテゴリに含まれるインスタンス（単語）をブートストラップアルゴリズムにより獲得する方法において、所定のカテゴリと関連性の高いインスタンスを検索ログを用いて取得する際に、所定のカテゴリと関連性の低いインスタンスを取得してしまう意味ドリフトの発生状態を評価する装置である。 The semantic drift occurrence evaluation device 1 uses a search log to acquire an instance highly relevant to a predetermined category in a method for acquiring instances (words) included in the predetermined category by a bootstrap algorithm. It is an apparatus for evaluating the occurrence state of semantic drift that acquires an instance having a low association with a category.

意味ドリフト発生評価装置１は、制御部１０と、記憶部２０と、表示部３１と、操作部３２とを備える。そして、制御部１０は、インスタンス抽出部１１と、反復実行制御部１２と、関連キーワード抽出部１３と、ベクトル生成部１４と、意味ドリフト評価部１５とから構成される。また、記憶部２０は、検索ログデータベース（以下、データベースをＤＢという）２１と、関連キーワード辞書ＤＢ２２とを記憶する。 The semantic drift occurrence evaluation device 1 includes a control unit 10, a storage unit 20, a display unit 31, and an operation unit 32. The control unit 10 includes an instance extraction unit 11, an iterative execution control unit 12, a related keyword extraction unit 13, a vector generation unit 14, and a semantic drift evaluation unit 15. The storage unit 20 also stores a search log database (hereinafter referred to as a database) 21 and a related keyword dictionary DB 22.

インスタンス抽出部１１は、検索ログＤＢ２１（後述の図２参照）を参照して、新たなインスタンスを抽出する。より具体的には、インスタンス抽出部１１は、検索ログＤＢ２１より、指定されたインスタンス集合を構成するインスタンスが含まれている検索クエリを抽出する。指定されたインスタンス集合とは、意味ドリフト発生評価装置１により処理が開始された最初の時点ではシードインスタンスの集合であり、インスタンス抽出部１１によりインスタンス集合が抽出された後は、インスタンス抽出部１１により抽出されたインスタンス集合である。インスタンス抽出部１１は、抽出した検索クエリより、当該インスタンス集合に含まれるインスタンス以外の単語をパターンとして抽出し、抽出したパターンにより構成されるパターン集合を生成する。そして、パターン集合に基づいて、検索ログＤＢ２１より当該パターン集合を構成するパターンが含まれる検索クエリを抽出する。そして、抽出した検索クエリより、当該パターン以外の単語をインスタンスとして抽出し、抽出したインスタンスにより構成されるインスタンス集合を生成する。また、パターン集合の生成時及びインスタンス集合の生成時に、それぞれパターンの信頼度及びインスタンスの信頼度を算出する。そして、信頼度の高いインスタンスの集合を所定のカテゴリに属する新たなインスタンス集合として抽出する。 The instance extraction unit 11 extracts a new instance with reference to the search log DB 21 (see FIG. 2 described later). More specifically, the instance extraction unit 11 extracts from the search log DB 21 a search query that includes instances that constitute a specified instance set. The designated instance set is a set of seed instances at the first time point when the process is started by the semantic drift occurrence evaluation device 1. After the instance set is extracted by the instance extraction unit 11, the instance extraction unit 11 An extracted instance set. The instance extraction unit 11 extracts words other than the instances included in the instance set as patterns from the extracted search query, and generates a pattern set composed of the extracted patterns. Then, based on the pattern set, a search query including a pattern constituting the pattern set is extracted from the search log DB 21. Then, words other than the pattern are extracted as instances from the extracted search query, and an instance set composed of the extracted instances is generated. The pattern reliability and the instance reliability are calculated at the time of generating the pattern set and the instance set, respectively. Then, a set of highly reliable instances is extracted as a new instance set belonging to a predetermined category.

パターンの信頼度及びインスタンスの信頼度について、より詳細に説明する。パターン集合Ｐ内のパターンｐの信頼度をｒπ（ｐ）とし（πは下付き文字、以下同じ）、インスタンス集合Ｉ内のインスタンスｉの信頼度をパターンの信頼度をｒι（ｉ）とすると（ιは下付き文字、以下同じ）、ｒπ（ｐ）は、信頼性の高いパターンが信頼性の高いインスタンスと共起するという直観に基づき、

により算出され、インスタンス集合Ｉ中の各インスタンスｉとパターンｐとの間の重み付き共起として定義されている。 The pattern reliability and the instance reliability will be described in more detail. Assume that the reliability of the pattern p in the pattern set P is rπ (p) (π is a subscript, the same applies hereinafter), and the reliability of the instance i in the instance set I is the reliability of the pattern rι (i) ( ι is a subscript, the same applies below), and rπ (p) is based on the intuition that a reliable pattern co-occurs with a reliable instance,

And is defined as a weighted co-occurrence between each instance i in the instance set I and the pattern p.

ｐｍｉ（ｉ，ｐ）は、インスタンスｉとパターンｐとの相互情報量（ＰＭＩ：ｐｏｉｎｔｗｉｓｅｍｕｔｕａｌｉｎｆｏｍａｔｉｏｎ）であり、ｍａｘｐｍｉは、パターン集合とインスタンス集合における最大の相互情報量である。ｐｍｉ（ｉ，ｐ）は、

により算出される。 pmi (i, p) is a mutual information amount (PMI) between the instance i and the pattern p, and max pmi is a maximum mutual information amount between the pattern set and the instance set. pmi (i, p) is

Is calculated by

｜ｉ，ｐ｜は、インスタンスｉとパターンｐとが同時に検索された回数、すなわち、検索ログＤＢ２１に含まれるインスタンスｉ及びパターンｐを含む検索クエリの数である。また、アスタリスクはワイルドカードである。 | I, p | is the number of times the instance i and the pattern p are simultaneously searched, that is, the number of search queries including the instance i and the pattern p included in the search log DB 21. An asterisk is a wild card.

インスタンスｉの信頼度もパターンｐの信頼度の算出方法と同様に、信頼度の高いインスタンスが信頼度の高いパターンと共起するものと定義され、

により算出される。 Similarly to the calculation method of the reliability of the pattern p, the reliability of the instance i is defined as an instance in which a highly reliable instance co-occurs with a highly reliable pattern,

Is calculated by

本実施形態では、数１及び数３に示される数式から確認できるように、ｒπ（ｐ）とｒι（ｉ）とは再帰的に定義される。パターンの信頼度とインスタンスの信頼度は、インスタンス抽出部１１により、パターン集合が生成される処理及びインスタンス集合が生成される処理との間で交互に算出される。 In the present embodiment, rπ (p) and rι (i) are recursively defined as can be confirmed from the mathematical formulas shown in Equation 1 and Equation 3. The pattern reliability and the instance reliability are alternately calculated by the instance extraction unit 11 between a process for generating a pattern set and a process for generating an instance set.

反復実行制御部１２は、ブートストラップアルゴリズムにより、インスタンス抽出部１１により新たに抽出されたインスタンス集合を用いたインスタンス抽出部１１の反復実行を制御する。より具体的には、反復実行制御部１２は、インスタンス抽出部１１の処理回数をカウントし、意味ドリフト発生評価装置１の管理者が指定した回数に達したか否かを判別する。指定した回数に達していない場合には、インスタンス抽出部１１によるインスタンスの抽出を反復実行させ、指定した回数に達した場合にはインスタンス抽出部１１によるインスタンスの抽出を終了する。このようにインスタンスの抽出を反復実行させることにより、多量のインスタンスの抽出が可能になる。 The repetitive execution control unit 12 controls repetitive execution of the instance extracting unit 11 using the instance set newly extracted by the instance extracting unit 11 by the bootstrap algorithm. More specifically, the iterative execution control unit 12 counts the number of processes of the instance extraction unit 11 and determines whether or not the number of times specified by the administrator of the semantic drift occurrence evaluation device 1 has been reached. When the specified number of times has not been reached, the instance extraction unit 11 repeatedly performs instance extraction. When the specified number of times has been reached, the instance extraction unit 11 ends the instance extraction. By repeating the instance extraction in this way, a large number of instances can be extracted.

関連キーワード抽出部１３は、関連キーワード辞書ＤＢ２２（後述の図３参照）を参照して、インスタンスの関連キーワードを抽出する。より具体的には、関連キーワード抽出部１３は、関連キーワード辞書ＤＢ２２を参照して、シードインスタンス又はインスタンス抽出部１１により抽出されたインスタンスそれぞれの関連キーワードを１又は複数抽出する。シードインスタンスとは、意味ドリフト発生評価装置１において、インスタンスを取得するための初期値であって人手により定められるものであり、所定のカテゴリに属する。 The related keyword extraction unit 13 refers to the related keyword dictionary DB 22 (see FIG. 3 described later) and extracts related keywords of the instance. More specifically, the related keyword extraction unit 13 refers to the related keyword dictionary DB 22 and extracts one or a plurality of related keywords for each instance extracted by the seed instance or the instance extraction unit 11. The seed instance is an initial value for acquiring an instance in the semantic drift occurrence evaluation apparatus 1 and is manually determined, and belongs to a predetermined category.

関連キーワード抽出部１３により関連キーワードが抽出されるタイミングは２通りある。すなわち、インスタンスがシードインスタンスである場合には、意味ドリフト発生評価装置１において、インスタンス抽出部１１による１回目のインスタンスの抽出がされる前に関連キーワードが抽出される。また、インスタンスがインスタンス抽出部１１により抽出されたインスタンスである場合には、インスタンス抽出部１１により新たなインスタンスが抽出された直後に関連キーワードが抽出される。 There are two timings at which the related keyword extraction unit 13 extracts related keywords. That is, when the instance is a seed instance, the semantic drift occurrence evaluation device 1 extracts the related keyword before the instance extraction unit 11 extracts the first instance. When the instance is an instance extracted by the instance extraction unit 11, the related keyword is extracted immediately after a new instance is extracted by the instance extraction unit 11.

ベクトル生成部１４は、関連キーワード抽出部１３により抽出された関連キーワードに対して数値を割り当てて、割り当てた数値を要素とする関連キーワードベクトルを生成する。より具体的には、ベクトル生成部１４は、インスタンス集合におけるそれぞれのインスタンスにおいて関連キーワード抽出部１３により抽出された関連キーワードの数を算出する。そして、当該算出した関連キーワードの数で１を除算して得られた数値を、関連キーワードに割り当て、割り当てた数値を要素とする関連キーワードベクトルを生成する。そして、インスタンス集合に含まれている全てのインスタンスのキーワードベクトルを集計する。 The vector generation unit 14 assigns a numerical value to the related keyword extracted by the related keyword extraction unit 13 and generates a related keyword vector having the assigned numerical value as an element. More specifically, the vector generation unit 14 calculates the number of related keywords extracted by the related keyword extraction unit 13 in each instance in the instance set. Then, a numerical value obtained by dividing 1 by the calculated number of related keywords is assigned to the related keyword, and a related keyword vector having the assigned numerical value as an element is generated. Then, the keyword vectors of all instances included in the instance set are totaled.

例えば、インスタンス集合Ａに含まれるインスタンスがＸ及びＹの２つである場合に、Ｘの関連キーワードとして、ａ、ｂ、ｃ、ｄの４つが抽出され、Ｙの関連キーワードとして、ａ及びｅの２つが抽出されたとする。そうすると、Ｘの関連キーワードのそれぞれに対して関連キーワードベクトルとして、１を４で除算した数である０．２５が付与され、ａ（０．２５）、ｂ（０．２５）、ｃ（０．２５）、ｄ（０．２５）となる。また、Ｙの関連キーワードのそれぞれに対して関連キーワードベクトルとして、１を２で除算した数である０．５が付与され、ａ（０．５）、ｅ（０．５）となる。そして、インスタンス集合のキーワードベクトルは、インスタンスＸ及びＹの関連キーワードベクトルを集計した結果、すなわち、ａ（０．７５）、ｂ（０．２５）、ｃ（０．２５）、ｄ（０．２５）、ｅ（０．５）となる。 For example, when there are two instances X and Y included in the instance set A, four keywords a, b, c, and d are extracted as related keywords of X, and a and e keywords are extracted as related keywords of Y. Assume that two are extracted. Then, 0.25 which is a number obtained by dividing 1 by 4 is assigned to each of the related keywords of X as a related keyword vector, and a (0.25), b (0.25), c (0. 25) and d (0.25). In addition, 0.5, which is a number obtained by dividing 1 by 2, is assigned to each of the related keywords of Y as a related keyword vector, which becomes a (0.5) and e (0.5). The keyword vector of the instance set is a result of aggregating related keyword vectors of the instances X and Y, that is, a (0.75), b (0.25), c (0.25), d (0.25). ), E (0.5).

意味ドリフト評価部１５は、各反復において関連キーワードに付与された関連キーワードベクトルに基づいて、各反復における意味ドリフトの程度を評価する。より具体的には、意味ドリフト評価部１５は、インスタンス抽出部１１の反復実行後にベクトル生成部１４により生成した関連キーワードベクトルと、当該反復実行の直前においてベクトル生成部１４により生成した関連キーワードベクトルとについてコサイン類似度を算出する。ここで、コサイン類似度とは、ベクトル間のコサイン距離である。各反復実行において生成した関連キーワードベクトルをＡとし、当該反復実行の直前における関連キーワードベクトルをＢとすると、コサイン類似度ｓｉｍ（Ａ，Ｂ）は、

により算出される。このように、各反復の前後のキーワードベクトルに基づいて算出されたコサイン類似度は、差分類似度と呼ばれる。 The semantic drift evaluation unit 15 evaluates the degree of semantic drift in each iteration based on the related keyword vector assigned to the related keyword in each iteration. More specifically, the semantic drift evaluation unit 15 includes a related keyword vector generated by the vector generation unit 14 after iterative execution of the instance extraction unit 11, and a related keyword vector generated by the vector generation unit 14 immediately before the iterative execution. Cosine similarity is calculated. Here, the cosine similarity is a cosine distance between vectors. If the related keyword vector generated in each repetitive execution is A and the related keyword vector immediately before the repetitive execution is B, the cosine similarity sim (A, B) is

Is calculated by Thus, the cosine similarity calculated based on the keyword vectors before and after each iteration is referred to as difference similarity.

例えば、ある反復において抽出されたインスタンス集合Ａのキーワードベクトルが、ａ（０．７５）、ｂ（０．２５）、ｃ（０．２５）、ｄ（０．２５）、ｅ（０．５）であり、当該反復の直前のインスタンス集合Ｂのキーワードベクトルがａ（０．３３）、ｂ（０．３３）、ｃ（０．３３）であるとすると、差分類似度としてのコサイン類似度ｓｉｍ（Ａ，Ｂ）は、数４の式に基づいて０．７２と算出される。 For example, the keyword vector of the instance set A extracted in a certain iteration is a (0.75), b (0.25), c (0.25), d (0.25), e (0.5). If the keyword vector of the instance set B immediately before the iteration is a (0.33), b (0.33), and c (0.33), the cosine similarity sim ( A, B) is calculated as 0.72 based on the equation (4).

さらに、意味ドリフト評価部１５は、インスタンス抽出部１１の各反復実行においてベクトル生成部１４により生成した関連キーワードベクトルと、シードインスタンスの集合においてベクトル生成部１４により生成した関連キーワードベクトルとについてもコサイン類似度を算出する。この場合のコサイン類似度の数式は、数４に示される数式と同様である。このように、シードインスタンスのキーワードベクトル及びある反復におけるキーワードベクトルに基づいて算出されたコサイン類似度は、シード類似度と呼ばれる。 Further, the semantic drift evaluation unit 15 cosine-likes the related keyword vector generated by the vector generation unit 14 in each iteration of the instance extraction unit 11 and the related keyword vector generated by the vector generation unit 14 in the set of seed instances. Calculate the degree. In this case, the expression for the cosine similarity is the same as the expression shown in Equation 4. Thus, the cosine similarity calculated based on the keyword vector of the seed instance and the keyword vector in a certain iteration is referred to as seed similarity.

差分類似度及びシード類似度は、いずれも０以上１以下の値をとることとなる。差分類似度は、各反復における関連キーワードベクトルが直前の反復からどれだけ関連キーワードが遷移したのかを計測するものであり、極端に数値が低下した位置で関連キーワードの遷移、すなわち、所定のカテゴリと関連性の低いインスタンスを取得する意味ドリフトが発生していると考えられる。すなわち、関連キーワードが類似している場合には、反復の前後でキーワードベクトルの変化が小さいことから、コサイン類似度が１に近い値となる。また、関連キーワードが類似していない場合には、反復の前後でキーワードベクトルの変化が大きくなり、コサイン類似度が０に近い値となる。すなわち、インスタンス抽出部１１の反復実行におけるコサイン類似度をモニタリングすることにより、キーワードベクトルが大きく変化したこと、すなわち、意味ドリフトが発生したことを認知しうる。 The difference similarity and the seed similarity both take values of 0 or more and 1 or less. The difference similarity measures how much the related keyword vector has transitioned from the previous iteration in the related keyword vector in each iteration, and the transition of the related keyword at a position where the numerical value is extremely lowered, that is, a predetermined category and It is considered that there is a semantic drift in acquiring instances that are not relevant. That is, when the related keywords are similar, the change in the keyword vector is small before and after the iteration, so that the cosine similarity is a value close to 1. If the related keywords are not similar, the keyword vector changes greatly before and after the iteration, and the cosine similarity becomes a value close to zero. That is, by monitoring the cosine similarity in the repeated execution of the instance extraction unit 11, it can be recognized that the keyword vector has changed greatly, that is, that a semantic drift has occurred.

シード類似度は、反復により抽出されたインスタンスの関連キーワードがシードインスタンスの関連キーワードからどの程度遷移したのかを計測するものであり、シード類似度を導入することにより、シードインスタンスとシードインスタンスの関連キーワードに基づいて、インスタンス抽出部１１の反復実行により抽出されたインスタンス集合の意味ドリフトの度合いについて評価を行うことができる。 The seed similarity measures how much the related keyword of the instance extracted by iteration has transitioned from the related keyword of the seed instance. By introducing the seed similarity, the related keyword of the seed instance and the seed instance is measured. Based on the above, the degree of semantic drift of the instance set extracted by the repeated execution of the instance extraction unit 11 can be evaluated.

なお、本実施形態では、コサイン類似度を算出することにより各反復における意味ドリフトの程度を評価することとしたが、これに限らない。例えば、ユークリッド距離、カルバックル・ライブラー距離を算出することにより、各反復における意味ドリフトの程度を評価することとしてもよい。なお、この場合には、ベクトル生成部１４で生成するベクトルを正規化する必要がある。ベクトルの正規化とは、生成した各々のベクトルについて、要素数で除算することをいう。 In the present embodiment, the degree of semantic drift in each iteration is evaluated by calculating the cosine similarity, but the present invention is not limited to this. For example, the degree of semantic drift in each iteration may be evaluated by calculating the Euclidean distance and the Calbuckle-Librer distance. In this case, it is necessary to normalize the vector generated by the vector generation unit 14. Vector normalization refers to dividing each generated vector by the number of elements.

図２は、本実施形態に係る検索ログＤＢ２１を示す図である。検索ログＤＢ２１は、所定の検索エンジンより抽出した検索ログに含まれる検索クエリを記憶するものである。検索ログＤＢ２１には、検索ログを構成する第１の単語（インスタンス）を記憶する「インスタンス１」フィールドと、検索ログを構成する第２の単語（インスタンス）を記憶する「インスタンス２」フィールドとが含まれている。すなわち、本実施形態では、検索ログＤＢ２１に、２つの単語から構成される検索クエリをそれぞれの単語（インスタンス）に分割して格納している。検索ログＤＢ２１は、インスタンス抽出部１１により、インスタンスを抽出するときに参照される。 FIG. 2 is a diagram showing the search log DB 21 according to the present embodiment. The search log DB 21 stores search queries included in a search log extracted from a predetermined search engine. The search log DB 21 has an “instance 1” field for storing the first word (instance) constituting the search log and an “instance 2” field for storing the second word (instance) constituting the search log. include. That is, in the present embodiment, the search log DB 21 stores a search query composed of two words divided into respective words (instances). The search log DB 21 is referred to when the instance extraction unit 11 extracts an instance.

なお、本実施形態では、検索ログＤＢ２１に２つの単語から構成される検索クエリのみを格納して、この検索クエリに基づいて新たなインスタンスを抽出することとしたが、これに限らない。例えば、検索ログＤＢ２１に、３つ以上の単語を含む検索クエリや形態素解析により分解された単語等を記憶させて、当該単語より新たなインスタンスを抽出することとしてもよい。このようにすることで、ソースデータが増加するので、様々なパターンを抽出することができる。 In the present embodiment, only the search query composed of two words is stored in the search log DB 21 and a new instance is extracted based on the search query. However, the present invention is not limited to this. For example, the search log DB 21 may store a search query including three or more words, a word decomposed by morphological analysis, or the like, and extract a new instance from the word. By doing so, the source data increases, so that various patterns can be extracted.

図３は、本実施形態に係る関連キーワード辞書ＤＢ２２を示す図である。関連キーワード辞書ＤＢ２２は、インスタンスと、インスタンスに付与されている関連キーワードとを関連付けて記憶したＤＢであり、予め所定の辞書サイトより抽出されたインスタンス及び関連キーワードを記憶する。関連キーワード辞書ＤＢ２２は、インスタンスを示す「インスタンス」フィールドと、インスタンスに付与されている関連キーワードを示す「関連キーワード」フィールドとが含まれている。関連キーワード辞書ＤＢ２２は、関連キーワード抽出部１３によりシードインスタンス又は新たに生成したインスタンスの関連キーワードを抽出するときに参照される。 FIG. 3 is a diagram showing the related keyword dictionary DB 22 according to the present embodiment. The related keyword dictionary DB 22 is a DB that stores an instance and a related keyword assigned to the instance in association with each other, and stores an instance and a related keyword extracted in advance from a predetermined dictionary site. The related keyword dictionary DB 22 includes an “instance” field indicating an instance and a “related keyword” field indicating a related keyword assigned to the instance. The related keyword dictionary DB 22 is referred to when the related keyword extraction unit 13 extracts related keywords of seed instances or newly generated instances.

なお、本実施形態では、予め所定の辞書サイトよりインスタンス及び関連キーワードを抽出することとしたが、これに限らない。例えば、検索結果に含まれるスニペットから得たキーワードを用いることとしてもよい。ここで、スニペットとは、検索エンジンにおける検索結果ページに含まれるＷｅｂページの紹介文であり、検索クエリに関連するキーワードが含まれている可能性が高い要素である。すなわち、スニペットを分析して、検索クエリ（インスタンス）の関連キーワードを抽出し、検索クエリ及び抽出した関連キーワードを関連キーワードＤＢ２２に記憶させることとしてもよい。 In the present embodiment, the instance and the related keyword are extracted from a predetermined dictionary site in advance. However, the present invention is not limited to this. For example, a keyword obtained from a snippet included in the search result may be used. Here, a snippet is an introductory sentence of a Web page included in a search result page in a search engine, and is an element that is highly likely to contain a keyword related to a search query. That is, a snippet is analyzed, a related keyword of a search query (instance) is extracted, and the search query and the extracted related keyword may be stored in the related keyword DB 22.

表示部３１は、意味ドリフト発生評価装置１の機能に関する表示を行い、意味ドリフト発生評価装置１を操作する者に対して、それぞれの機能により出力された情報を視覚的に表示する。操作部３２は、意味ドリフト発生評価装置１を操作する者からの直接的な入力を受け付ける。 The display unit 31 performs display related to the function of the semantic drift occurrence evaluation device 1 and visually displays information output by each function to a person who operates the semantic drift occurrence evaluation device 1. The operation unit 32 receives a direct input from a person who operates the semantic drift occurrence evaluation device 1.

［意味ドリフト発生評価装置のハードウェア構成］
図４は、本実施形態に係る意味ドリフト発生評価装置１のハードウェア構成を示す図である。本発明が実施される意味ドリフト発生評価装置１は標準的なものでよく、以下に構成の一例を示す。 [Hardware configuration of semantic drift generation evaluation device]
FIG. 4 is a diagram illustrating a hardware configuration of the semantic drift occurrence evaluation apparatus 1 according to the present embodiment. The semantic drift occurrence evaluation apparatus 1 in which the present invention is implemented may be a standard one, and an example of the configuration is shown below.

意味ドリフト発生評価装置１は、制御部１０を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１０（マルチプロセッサ構成ではＣＰＵ１０１２等複数のＣＰＵが追加されてもよい）、バスライン１００５、通信Ｉ／Ｆ（Ｉ／Ｆ：インターフェイス）１０４０、メインメモリ１０５０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）１０６０、表示装置１０２２、Ｉ／Ｏコントローラ１０７０、キーボード及びマウス等の入力装置１１００、ハードディスク１０７４、光ディスクドライブ１０７６、並びに半導体メモリ１０７８を備える。なお、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８をまとめて記憶部２０と呼ぶ。 The semantic drift occurrence evaluation device 1 includes a CPU (Central Processing Unit) 1010 (a plurality of CPUs such as a CPU 1012 may be added in a multiprocessor configuration), a bus line 1005, a communication I / F (I / F). F: Interface) 1040, main memory 1050, BIOS (Basic Input Output System) 1060, display device 1022, I / O controller 1070, input device 1100 such as keyboard and mouse, hard disk 1074, optical disk drive 1076, and semiconductor memory 1078 Prepare. The hard disk 1074, the optical disk drive 1076, and the semiconductor memory 1078 are collectively referred to as the storage unit 20.

制御部１０は、意味ドリフト発生評価装置１に係る各種機能を統括的に制御する部分であり、ハードディスク１０７４に記憶された各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The control unit 10 is a part that comprehensively controls various functions related to the semantic drift occurrence evaluation apparatus 1, and cooperates with the hardware described above by appropriately reading and executing various programs stored in the hard disk 1074. Various functions according to the present invention are realized.

通信Ｉ／Ｆ１０４０は、意味ドリフト発生評価装置１が、通信ネットワークを介して他のサーバ等と情報を送受信する場合のネットワーク・アダプタである。通信Ｉ／Ｆ１０４０は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。 The communication I / F 1040 is a network adapter when the semantic drift occurrence evaluation device 1 transmits and receives information to and from other servers via a communication network. The communication I / F 1040 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.

メインメモリ１０５０は、ＣＰＵ１０１０により各種プログラムを実行する際に生成されるデータを一時的に記憶する。ＢＩＯＳ１０６０は、意味ドリフト発生評価装置１の起動時にＣＰＵ１０１０が実行するブートプログラムや、意味ドリフト発生評価装置１のハードウェアに依存するプログラム等を記録する。 The main memory 1050 temporarily stores data generated when the CPU 1010 executes various programs. The BIOS 1060 records a boot program executed by the CPU 1010 when the semantic drift occurrence evaluation apparatus 1 is started, a program depending on the hardware of the semantic drift occurrence evaluation apparatus 1, and the like.

表示装置１０２２は、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含み、表示部３１として機能する。 The display device 1022 includes a display device such as a cathode ray tube display device (CRT) or a liquid crystal display device (LCD), and functions as the display unit 31.

Ｉ／Ｏコントローラ１０７０には、ハードディスク１０７４、光ディスクドライブ１０７６、及び半導体メモリ１０７８等の記憶装置である記憶部２０を接続することができる。 The I / O controller 1070 can be connected to a storage unit 20 that is a storage device such as a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078.

入力装置１１００は、意味ドリフト発生評価装置１の管理者による入力の受け付けを行うものであり、操作部３２として機能する。 The input device 1100 accepts input by the administrator of the semantic drift occurrence evaluation device 1 and functions as the operation unit 32.

ハードディスク１０７４は、本ハードウェアを意味ドリフト発生評価装置１として機能させるための各種プログラム、本発明の機能を実行するプログラム及び上述のＤＢを記憶する。なお、意味ドリフト発生評価装置１は、外部に別途設けたハードディスク（図示せず）を外部記憶装置として利用することもできる。 The hard disk 1074 stores various programs for causing the hardware to function as the semantic drift occurrence evaluation apparatus 1, a program for executing the functions of the present invention, and the above-described DB. The semantic drift occurrence evaluation apparatus 1 can also use a hard disk (not shown) separately provided as an external storage device.

光ディスクドライブ１０７６としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ブルーレイディスク（Ｂｌｕ−ｒａｙＤｉｓｃ：登録商標）ドライブを使用することができる。光ディスクドライブ１０７６を使用する場合は、光ディスクドライブ１０７６に対応した光ディスク１０７７を使用する。光ディスク１０７７から光ディスクドライブ１０７６によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供することもできる。 As the optical disk drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, and a Blu-ray Disc (registered trademark) drive can be used. When the optical disk drive 1076 is used, the optical disk 1077 corresponding to the optical disk drive 1076 is used. A program or data may be read from the optical disk 1077 by the optical disk drive 1076 and provided to the main memory 1050 or the hard disk 1074 via the I / O controller 1070.

なお、本発明でいうコンピュータとは、記憶装置、制御部等を備えた情報処理装置をいい、意味ドリフト発生評価装置１は、上述のように、制御部１０、記憶部２０等を備えた情報処理装置により構成され、この情報処理装置は、本発明のコンピュータの概念に含まれる。 The computer in the present invention refers to an information processing device including a storage device, a control unit, and the like, and the semantic drift occurrence evaluation device 1 is information including the control unit 10 and the storage unit 20 as described above. The information processing apparatus is constituted by a processing apparatus and is included in the concept of the computer of the present invention.

また、意味ドリフト発生評価装置１は、ハードウェアの数に制限はなく、必要に応じて１又は複数のハードウェアで構成してよい。また、複数のハードウェアで構成する場合には、通信ネットワークを介して各ハードウェアを接続してもよい。例えば、各機能ごとに別サーバ（装置）とし、各サーバ間での信号の送受信により、各サーバを連携させることで、本実施形態の機能を実現してもよい。 The semantic drift occurrence evaluation device 1 is not limited in the number of hardware, and may be configured by one or a plurality of hardware as necessary. In the case of a plurality of hardware, each hardware may be connected via a communication network. For example, the functions of the present embodiment may be realized by using separate servers (devices) for each function and linking the servers by transmitting and receiving signals between the servers.

［処理フロー］
図５は、本実施形態に係る意味ドリフト発生評価装置１が行う処理の流れを示すフローチャートである。 [Processing flow]
FIG. 5 is a flowchart showing a flow of processing performed by the semantic drift occurrence evaluation apparatus 1 according to the present embodiment.

ステップＳ１では、制御部１０（関連キーワード抽出部１３）は、関連キーワード辞書ＤＢ２２を参照して、シードインスタンスの関連キーワードを抽出する。 In step S1, the control unit 10 (related keyword extraction unit 13) refers to the related keyword dictionary DB 22 and extracts related keywords of seed instances.

ステップＳ２では、制御部１０（ベクトル生成部１４）は、ステップＳ１にて抽出された関連キーワードに対して、数値を割り当てて関連キーワードベクトルを生成する。より具体的には、ベクトル生成部１４は、関連キーワード抽出部１３により抽出されたインスタンス集合におけるそれぞれのインスタンスに対して抽出された関連キーワードの数を算出する。そして、当該関連キーワードの数で１を除算した数を、関連キーワードベクトルとして関連キーワードに付与する。そして、シードインスタンスの集合に含まれている全てのインスタンスのキーワードベクトルを集計する。 In step S2, the control unit 10 (vector generation unit 14) assigns a numerical value to the related keyword extracted in step S1 and generates a related keyword vector. More specifically, the vector generation unit 14 calculates the number of related keywords extracted for each instance in the instance set extracted by the related keyword extraction unit 13. Then, the number obtained by dividing 1 by the number of the related keywords is assigned to the related keywords as a related keyword vector. Then, the keyword vectors of all the instances included in the set of seed instances are aggregated.

ステップＳ３では、制御部１０（インスタンス抽出部１１）は、検索ログＤＢ２１を参照して新たなインスタンスを抽出する。インスタンスの抽出については、図６で詳細に説明する。 In step S3, the control unit 10 (instance extraction unit 11) extracts a new instance with reference to the search log DB 21. The instance extraction will be described in detail with reference to FIG.

ステップＳ４では、制御部１０（関連キーワード抽出部１３）は、関連キーワード辞書ＤＢ２２を参照して、ステップＳ３にて抽出された新たなインスタンスの関連キーワードを抽出する。 In step S4, the control unit 10 (related keyword extraction unit 13) refers to the related keyword dictionary DB 22 and extracts the related keywords of the new instance extracted in step S3.

ステップＳ５では、制御部１０（ベクトル生成部１４）は、ステップＳ４にて抽出された関連キーワードに対して、数値を割り当てて関連キーワードベクトルを生成する。 In step S5, the control unit 10 (vector generation unit 14) assigns a numerical value to the related keyword extracted in step S4 and generates a related keyword vector.

ステップＳ６では、制御部１０（意味ドリフト評価部１５）は、各反復において関連キーワードに付与された関連キーワードベクトルに基づいて、コサイン類似度を算出して各反復における意味ドリフトの程度を評価する。 In step S6, the control unit 10 (the semantic drift evaluation unit 15) calculates the cosine similarity based on the related keyword vector assigned to the related keyword in each iteration, and evaluates the degree of the semantic drift in each iteration.

ステップＳ７では、制御部１０（反復実行制御部１２）は、インスタンス抽出部１１の処理回数をカウントする。なお、処理回数のカウントは、意味ドリフト発生評価装置１において処理を開始するときに０にリセットされる。 In step S <b> 7, the control unit 10 (repetitive execution control unit 12) counts the number of processings of the instance extraction unit 11. The count of the number of processes is reset to 0 when the semantic drift occurrence evaluation apparatus 1 starts the process.

ステップＳ８では、制御部１０（反復実行制御部１２）は、処理を継続するか否かを判別する。より具体的には、ステップＳ７にてカウントしたインスタンス抽出部１１の処理回数が意味ドリフト発生評価装置１の管理者が指定した回数に達したか否かを判別する。この判別結果がＹＥＳのときは処理を終了し、ＮＯのときはステップＳ３に移る。 In step S8, the control unit 10 (repetitive execution control unit 12) determines whether or not to continue the process. More specifically, it is determined whether or not the number of processes of the instance extraction unit 11 counted in step S7 has reached the number designated by the administrator of the semantic drift occurrence evaluation apparatus 1. If the determination result is YES, the process ends, and if the determination result is NO, the process proceeds to step S3.

図６は、本実施形態に係る制御部１０のインスタンス抽出部１１におけるインスタンス抽出の処理の流れを示すフローチャートである。 FIG. 6 is a flowchart showing a flow of instance extraction processing in the instance extraction unit 11 of the control unit 10 according to the present embodiment.

ステップＳ３１では、制御部１０（インスタンス抽出部１１）は、検索ログＤＢ２１を参照して、指定されたインスタンス集合を構成するインスタンスを含む検索クエリを抽出する。ステップＳ３２では、制御部１０（インスタンス抽出部１１）は、ステップＳ３１にて抽出した検索クエリより、指定されたインスタンス集合に含まれるインスタンス以外の単語をパターンとして抽出し、抽出したパターンにより構成されるパターン集合を生成する。 In step S <b> 31, the control unit 10 (instance extraction unit 11) refers to the search log DB 21 and extracts a search query including instances that constitute the designated instance set. In step S32, the control unit 10 (instance extraction unit 11) extracts words other than the instances included in the specified instance set as patterns from the search query extracted in step S31, and is configured by the extracted patterns. Generate a pattern set.

ステップＳ３３では、制御部１０（インスタンス抽出部１１）は、ステップＳ３２にて生成されたパターン集合に含まれる全てのパターンについて、数式１に従って信頼度を算出する。ステップＳ３４では、制御部１０（インスタンス抽出部１１）は、検索ログＤＢ２１を参照して、ステップＳ３２にて生成されたパターン集合に含まれるパターンを含む検索クエリを抽出する。 In step S33, the control unit 10 (instance extraction unit 11) calculates the reliability according to Equation 1 for all patterns included in the pattern set generated in step S32. In step S34, the control unit 10 (instance extraction unit 11) refers to the search log DB 21 and extracts a search query including a pattern included in the pattern set generated in step S32.

ステップＳ３５では、制御部１０（インスタンス抽出部１１）は、ステップＳ３４にて抽出した検索クエリより、ステップＳ３２にて生成されたパターン集合に含まれるパターン以外の単語をインスタンスとして抽出し、抽出したインスタンスにより構成されるインスタンス集合を生成する。ステップＳ３６では、制御部１０（インスタンス抽出部１１）は、ステップＳ３５にて生成されたインスタンス集合に含まれる全てのインスタンスについて、数３に示される式に従って信頼度を算出する。 In step S35, the control unit 10 (instance extraction unit 11) extracts words other than the patterns included in the pattern set generated in step S32 as instances from the search query extracted in step S34, and extracts the extracted instances. An instance set composed of is generated. In step S36, the control unit 10 (instance extraction unit 11) calculates the reliability according to the equation shown in Equation 3 for all instances included in the instance set generated in step S35.

ステップＳ３７では、制御部１０（インスタンス抽出部１１）は、ステップＳ３６にて算出された信頼度に基づいて、信頼度が高いインスタンスをインスタンス集合として抽出し、インスタンス抽出の処理を終了する。 In step S37, the control unit 10 (instance extraction unit 11) extracts instances having high reliability as an instance set based on the reliability calculated in step S36, and ends the instance extraction processing.

［実験結果］
続いて、本実施形態に係る意味ドリフト発生評価装置１において、反復実行制御部１２によりインスタンスの生成を反復実行したときの、各反復における意味ドリフトの程度を評価したときの実験結果を示す。実験するに当たり、検索ログＤＢ２１に記憶するデータとして、Ｙａｈｏｏ！（登録商標）検索の２００８年８月分の検索ログのうち、空白文字で区切られた２つの単語で構成されたものを用いた。そして、関連キーワード辞書ＤＢ２２に記憶するデータとして、Ｗｉｋｉｐｅｄｉａ（登録商標）の２００８年７月２４日版のダンプを使用した。 [Experimental result]
Subsequently, in the semantic drift occurrence evaluation apparatus 1 according to the present embodiment, an experimental result when the degree of semantic drift in each iteration is evaluated when the generation of instances is repeatedly executed by the iteration execution control unit 12 is shown. In the experiment, as data stored in the search log DB 21 Yahoo! Among search logs for August 2008 of (registered trademark) search, a search log composed of two words separated by a space character was used. As a data to be stored in the related keyword dictionary DB 22, a dump of Wikipedia (registered trademark) July 24, 2008 was used.

また、インスタンス抽出部１１により生成されたインスタンス集合に含まれるインスタンスであって、信頼度が上位５００位以内のインスタンスを、インスタンス抽出部１１により新たに抽出されるインスタンスとした。また、インスタンスの信頼度の算出では、パターン集合に含まれる全てのパターンを用いた。また、インスタンス抽出処理の反復回数を５０回とした。また、シードインスタンスとして、所定のカテゴリが「野球選手」である５人の野球選手の氏名を用いることとした。 In addition, instances that are included in the instance set generated by the instance extraction unit 11 and have the highest reliability within the top 500 are determined as instances newly extracted by the instance extraction unit 11. In calculating the reliability of the instance, all patterns included in the pattern set were used. In addition, the number of iterations of the instance extraction process is 50. In addition, the names of five baseball players whose predetermined category is “baseball players” are used as seed instances.

図７は、シード類似度及び差分類似度の実験結果を示す図である。縦軸は差分類似度及びシード類似度を示し、横軸は反復回数を示す。 FIG. 7 is a diagram illustrating experimental results of seed similarity and difference similarity. The vertical axis indicates the difference similarity and the seed similarity, and the horizontal axis indicates the number of iterations.

図７に示されるように、インスタンス抽出部１１の５回目の反復実行後に意味ドリフト評価部１５により算出された差分類似度及びシード類似度が、５回目の反復実行前に意味ドリフト評価部１５により算出された差分類似度及びシード類似度に比べて、大きく低下していることが確認できる。５回目の反復実行において、入れ替わった２３６個のインスタンスの精査を行った結果、インスタンス抽出部１１により新たに抽出されたインスタンスの中に野球選手が含まれていないこと、すなわち、意味ドリフトの発生を確認できた。したがって、本実施形態に示す意味ドリフト評価部１５により、差分類似度を算出し、差分類似度の変化の度合いについて評価することにより意味ドリフトが発生していることを認知できる。 As shown in FIG. 7, the difference similarity and the seed similarity calculated by the semantic drift evaluation unit 15 after the fifth iteration execution of the instance extraction unit 11 are performed by the semantic drift evaluation unit 15 before the fifth iteration execution. It can be confirmed that the calculated difference similarity and seed similarity are greatly reduced. As a result of scrutinizing the replaced 236 instances in the fifth iteration, the instance newly extracted by the instance extraction unit 11 does not include a baseball player, that is, occurrence of semantic drift. It could be confirmed. Therefore, the semantic drift evaluation unit 15 shown in the present embodiment can recognize that the semantic drift has occurred by calculating the difference similarity and evaluating the degree of change of the difference similarity.

また、シード類似度は、反復により抽出されたインスタンスの関連キーワードがシードインスタンスの関連キーワードからどの程度遷移したのかを計測するものであることから、５回目の反復において抽出されたインスタンスの関連キーワードがシードインスタンスの関連キーワードから大きく遷移したことが確認できる。したがって、意味ドリフト評価部１５により、シード類似度を算出し、シード類似度の変化の度合いについて評価することによっても意味ドリフトが発生していることを認知できる。 In addition, since the seed similarity measures how much the related keyword of the instance extracted by the iteration has shifted from the related keyword of the seed instance, the related keyword of the instance extracted in the fifth iteration is determined. It can be confirmed that a significant transition has been made from the related keywords of the seed instance. Therefore, the semantic drift evaluation unit 15 can recognize that the semantic drift has occurred by calculating the seed similarity and evaluating the degree of change in the seed similarity.

以上、本発明の実施形態について説明したが、本発明は本実施形態に限定されるものではなく、本発明の目的を達成できる範囲での変形、改良等は本発明に含まれるものである。 The embodiment of the present invention has been described above, but the present invention is not limited to the present embodiment, and modifications, improvements, and the like within the scope that can achieve the object of the present invention are included in the present invention.

１意味ドリフト発生評価装置
１０制御部
１１インスタンス抽出部
１２反復実行制御部
１３関連キーワード抽出部
１４ベクトル生成部
１５意味ドリフト評価部
２１検索ログＤＢ
２２関連キーワード辞書ＤＢ
３１表示部
３２操作部 DESCRIPTION OF SYMBOLS 1 Semantic drift generation | occurrence | production evaluation apparatus 10 Control part 11 Instance extraction part 12 Iterative execution control part 13 Related keyword extraction part 14 Vector generation part 15 Semantic drift evaluation part 21 Search log DB
22 Related keyword dictionary DB
31 Display unit 32 Operation unit

Claims

In the method of acquiring an instance included in a predetermined category by a bootstrap algorithm, when an instance highly relevant to the predetermined category is acquired using a search log, an occurrence state of a semantic drift in which the meaning of the category changes is evaluated. Meaning drift generation evaluation method
A first extraction step of extracting a new instance based on the search log;
When repeating the execution of the first extraction step using the new instance extracted in the first extraction step by the bootstrap algorithm,
A second extraction step of extracting a related keyword of the instance from a related keyword dictionary stored in advance;
Generating a related keyword vector whose elements are numerical values assigned to the related keywords;
A semantic drift generation evaluation method comprising: evaluating a degree of semantic drift in each iteration based on a related keyword vector in each iteration.

The step of evaluating the degree of semantic drift includes calculating cosine similarity for the related keyword vector in each iteration and the related keyword vector immediately before the iteration, and evaluating the degree of semantic drift based on the cosine similarity. The semantic drift generation evaluation method according to claim 1, wherein:

The step of evaluating the degree of semantic drift includes cosine similarity with respect to a related keyword vector in each iteration and a related keyword vector of a seed instance that is an initial value when the new instance is extracted in the first extraction step. The semantic drift occurrence evaluation method according to claim 1, wherein the degree of semantic drift is evaluated based on the cosine similarity.

In the method of acquiring an instance included in a predetermined category by a bootstrap algorithm, when an instance highly relevant to the predetermined category is acquired using a search log, an occurrence state of a semantic drift in which the meaning of the category changes is evaluated. Meaning drift generation evaluation device
An instance extracting means for extracting a new instance based on the search log;
Repetitive execution control means for repeating the execution of the instance extraction means using the new instance extracted by the instance extraction means by the bootstrap algorithm;
Related keyword extraction means for extracting a related keyword of an instance from a pre-stored related keyword dictionary;
Vector generation means for generating a related keyword vector whose elements are numerical values assigned to the related keywords;
A semantic drift generation evaluation device comprising semantic drift evaluation means for evaluating the degree of semantic drift in each of the iterations based on the related keyword vector in each of the iterations.