JP2013161136A

JP2013161136A - Teacher data generation device, method and program

Info

Publication number: JP2013161136A
Application number: JP2012020229A
Authority: JP
Inventors: Sumio Fujita; 澄男藤田
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2012-02-01
Filing date: 2012-02-01
Publication date: 2013-08-19
Anticipated expiration: 2032-02-01
Also published as: JP5394512B2

Abstract

PROBLEM TO BE SOLVED: To provide a teacher data generation device, method and program that can efficiently generate teacher data.SOLUTION: A teacher data generation server 3 includes: a teacher data DB 31 for storing teacher data; a query reception section 32 for receiving a first query and a second query different from the first query; a category DB 33 for storing a category and information showing a parental relation between the category and other categories in association with each other; a degree of association calculation section 34 for calculating a degree of association between a first category corresponding to the first query received in the query reception section 32 and a second category corresponding to the second query on the basis of the category DB 33; an identity information acquisition section 35 for acquiring identity information showing each character of the first query and the second query; and a storage control section 36 for storing the first query, the second query, the degree of association and the identity information in association with each other as teacher data in the teacher data DB 31.

Description

本発明は、機械学習に使用される教師データを生成する教師データ生成装置、方法及びプログラムに関する。 The present invention relates to a teacher data generation apparatus, method, and program for generating teacher data used for machine learning.

従来、コンピュータが大量に存在するデータを分類する場合に、機械学習を用いた分類が行われている。この機械学習のうち、教師あり学習と呼ばれる手法では、コンピュータが、サンプルデータについて、データの特徴を表す情報である素性情報と、分類結果とを予め定めた教師データを生成する。そして、コンピュータが、この教師データに基づいて分類パターンを学習した後、未分類のデータを、先の学習に基づいて分類を行う（例えば、特許文献１参照）。 Conventionally, classification using machine learning is performed when a computer classifies a large amount of data. Among the machine learning methods, in a method called supervised learning, a computer generates, for sample data, teacher data in which feature information that is information representing data characteristics and classification results are predetermined. After the computer learns the classification pattern based on the teacher data, the computer classifies the unclassified data based on the previous learning (see, for example, Patent Document 1).

特開２００５−１８１９２８号公報JP 2005-181928 A

ところで、特許文献１に開示されている教師あり学習では、サンプルデータからの教師データの生成が人手により行われている。このため、教師データの質が保証されるため、特許文献１に開示されている教師あり学習では、データの分類精度が高い。 By the way, in supervised learning disclosed in Patent Document 1, generation of teacher data from sample data is performed manually. For this reason, since the quality of the teacher data is guaranteed, the supervised learning disclosed in Patent Document 1 has high data classification accuracy.

しかしながら、特許文献１に開示されている教師あり学習では、教師データの生成を人手により行うため、教師データの生成に時間を要してしまうという問題がある。そこで、教師データの数を少なくする方法が考えられるが、教師データの数を少なくすると、機械学習の精度が低下してしまう。 However, in the supervised learning disclosed in Patent Document 1, since teacher data is generated manually, there is a problem that it takes time to generate teacher data. Thus, a method of reducing the number of teacher data is conceivable. However, if the number of teacher data is reduced, the accuracy of machine learning is lowered.

本発明は、効率的に教師データを生成することができる教師データ生成装置、方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a teacher data generation apparatus, method, and program capable of efficiently generating teacher data.

（１）機械学習に使用される教師データを生成する教師データ生成装置であって、第１クエリと、前記第１クエリと異なる第２クエリとを取得するクエリ取得手段と、カテゴリと、前記カテゴリと他のカテゴリとの親子関係を示す情報とを関連付けて記憶するカテゴリ情報記憶手段に基づいて、前記取得手段により取得された第１クエリに対応する第１のカテゴリと、前記取得手段により取得された第２クエリに対応する第２のカテゴリとの関連度を算出する関連度算出手段と、前記第１クエリ及び前記第２クエリそれぞれの性質を表す情報と、前記第１クエリ及び前記第２クエリの関連性を表す情報との少なくともいずれかを素性情報として取得する素性情報取得手段と、前記第１クエリ、前記第２クエリ、前記関連度及び前記素性情報を関連付けて教師データとして所定の記憶手段に記憶させる記憶制御手段と、を備える教師データ生成装置。 (1) A teacher data generation device that generates teacher data used for machine learning, a query acquisition unit that acquires a first query and a second query different from the first query, a category, and the category A first category corresponding to the first query acquired by the acquisition unit based on a category information storage unit that associates and stores information indicating a parent-child relationship with another category, and acquired by the acquisition unit Relevance calculating means for calculating relevance with the second category corresponding to the second query, information representing the properties of the first query and the second query, the first query, and the second query. Feature information acquisition means for acquiring at least one of information representing the relationship between the first query, the second query, the degree of association, and the feature information as feature information. Tutor data generating apparatus and a storage control means for storing in a predetermined storage means as teacher data put communicating.

（２）前記関連度算出手段は、前記カテゴリ情報記憶手段に基づいて、前記第１のカテゴリと、前記第１のカテゴリに対して最上位となるカテゴリとのパスと、前記第２のカテゴリと、前記第２のカテゴリに対して最上位となるカテゴリとのパスとを特定し、特定されたパスに基づいて前記関連度を算出する、（１）に記載の教師データ生成装置。 (2) Based on the category information storage unit, the degree-of-relevance calculation unit includes a path between the first category and a category that is the highest level with respect to the first category, and the second category. The teacher data generation device according to (1), wherein a path with a category that is the highest level with respect to the second category is specified, and the degree of association is calculated based on the specified path.

（３）前記関連度算出手段は、特定されたパスのうち最上位のカテゴリから共通するパスの長さに基づいて前記関連度を算出する、（２）に記載の教師データ生成装置。 (3) The teacher data generation device according to (2), wherein the relevance calculation unit calculates the relevance based on a common path length from the highest category among the identified paths.

（４）前記関連度算出手段は、特定されたパスにおいて共通するパスの長さに基づいて前記関連度を算出する、（２）に記載の教師データ生成装置。 (4) The teacher data generation device according to (2), wherein the relevance calculation unit calculates the relevance based on a common path length in the identified paths.

（５）前記関連度算出手段は、特定されたパスのうち最上位のカテゴリから共通するパスの長さと、特定されたパスにおいて共通するパスの長さとに基づいて前記関連度を算出する、（２）に記載の教師データ生成装置。 (5) The relevance calculation means calculates the relevance based on a common path length from the highest category among the identified paths and a common path length in the identified paths. The teacher data generation device according to 2).

（６）コンピュータが、機械学習に使用される教師データの生成を実行する教師データ生成方法であって、第１クエリと、前記第１クエリと異なる第２クエリとを取得するクエリ取得ステップと、カテゴリと、前記カテゴリと他のカテゴリとの親子関係を示す情報とを関連付けて記憶するカテゴリ情報記憶手段に基づいて、前記クエリ取得ステップにおいて取得された第１クエリに対応する第１のカテゴリと、前記クエリ取得ステップにおいて取得された第２クエリに対応する第２のカテゴリとの関連度を算出する関連度算出ステップと、前記第１クエリ及び前記第２クエリそれぞれの性質を表す素性情報と、前記第１クエリ及び前記第２クエリの関連性を表す情報との少なくともいずれかを素性情報として取得する素性情報取得ステップと、前記第１クエリ、前記第２クエリ、前記関連度及び前記素性情報を関連付けて教師データとして所定の記憶手段に記憶させる記憶制御ステップと、をコンピュータが実行する教師データ生成方法。 (6) A teacher data generation method in which a computer executes generation of teacher data used for machine learning, a query acquisition step of acquiring a first query and a second query different from the first query; A first category corresponding to the first query acquired in the query acquisition step based on category information storage means for storing the category and information indicating a parent-child relationship between the category and another category in association with each other; A degree-of-association calculating step for calculating a degree of association with the second category corresponding to the second query acquired in the query acquisition step, feature information representing the properties of the first query and the second query, A feature information acquisition step of acquiring at least one of information representing the relationship between the first query and the second query as feature information; The first query, the second query, the relevance and the teacher data generating method and a storage control step, the computer executes the feature information that are correlated is stored in a predetermined storage means as teacher data.

（７）機械学習に使用される教師データの生成をコンピュータに実行させる教師データ生成プログラムであって、第１クエリと、前記第１クエリと異なる第２クエリとを取得するクエリ取得ステップと、カテゴリと、前記カテゴリと他のカテゴリとの親子関係を示す情報とを関連付けて記憶するカテゴリ情報記憶手段に基づいて、前記クエリ取得ステップにおいて取得された第１クエリに対応する第１のカテゴリと、前記クエリ取得ステップにおいて取得された第２クエリに対応する第２のカテゴリとの関連度を算出する関連度算出ステップと、前記第１クエリ及び前記第２クエリそれぞれの性質を表す素性情報と、前記第１クエリ及び前記第２クエリの関連性を表す情報との少なくともいずれかを素性情報として取得する素性情報取得ステップと、前記第１クエリ、前記第２クエリ、前記関連度及び前記素性情報を関連付けて教師データとして所定の記憶手段に記憶させる記憶制御ステップと、をコンピュータに実行させる教師データ生成プログラム。 (7) A teacher data generation program for causing a computer to generate teacher data used for machine learning, a query acquisition step for acquiring a first query and a second query different from the first query, a category A first category corresponding to the first query acquired in the query acquisition step based on category information storage means that stores information indicating a parent-child relationship between the category and another category in association with each other, and A degree-of-association calculating step of calculating a degree of association with the second category corresponding to the second query acquired in the query acquisition step, feature information representing the properties of the first query and the second query, and the first A feature information acquisition step for acquiring, as feature information, at least one of information representing the relevance of one query and the second query. Flop and the first query, the second query, the relevance and teacher data generating program to be executed by the storage control step, to the computer to be stored in a predetermined storage means as teacher data in association with the feature information.

本発明によれば、効率的に教師データを生成することができる教師データ生成装置、方法及びプログラムを提供することが可能となる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to provide the teacher data generation apparatus, method, and program which can generate teacher data efficiently.

類似度判定システムを示す図である。It is a figure which shows a similarity determination system. 類似度判定システムを構成するクエリ抽出サーバ、教師データ生成サーバ及び類似度判定サーバの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the query extraction server which comprises a similarity determination system, a teacher data generation server, and a similarity determination server. 検索ログＤＢを示す図である。It is a figure which shows search log DB. 教師データＤＢを示す図である。It is a figure which shows teacher data DB. カテゴリＤＢを示す図である。It is a figure which shows category DB. クエリＤＢを示す図である。It is a figure which shows query DB. クエリ抽出サーバ及び教師データ生成サーバが実行する教師データ生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the teacher data generation process which a query extraction server and a teacher data generation server perform. 類似度判定サーバが実行する類似度判定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the similarity determination process which a similarity determination server performs.

以下、本発明の実施形態について、図を参照しながら説明する。
［類似度判定システムの概要］
図１は、本実施形態に係る類似度判定システム１を示す図である。
類似度判定システム１は、クエリ抽出サーバ２と、教師データ生成装置としての教師データ生成サーバ３と、類似度判定サーバ４とから構成されている。クエリ抽出サーバ２と、教師データ生成サーバ３と、類似度判定サーバ４とは、インターネット等の通信ネットワークＮを介して通信可能に構成されている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Outline of similarity determination system]
FIG. 1 is a diagram showing a similarity determination system 1 according to the present embodiment.
The similarity determination system 1 includes a query extraction server 2, a teacher data generation server 3 as a teacher data generation device, and a similarity determination server 4. The query extraction server 2, the teacher data generation server 3, and the similarity determination server 4 are configured to be communicable via a communication network N such as the Internet.

類似度判定システム１では、クエリ抽出サーバ２が、検索エンジンにおいて用いられたクリックログから、第１のクエリと、第２のクエリとの組み合わせを抽出する。教師データ生成サーバ３は、クエリ抽出サーバ２において抽出された第１のクエリと、第２のクエリとについて、カテゴリ情報に基づいて関連度を算出するとともに、第１クエリ及び第２クエリそれぞれの性質を表す情報と、第１クエリ及び第２クエリの関連性を表す情報との少なくともいずれかを素性情報として取得する。そして、教師データ生成サーバ３は、第１クエリ、第２クエリ、関連度及び素性情報を関連付けて教師データを生成する。類似度判定サーバ４は、教師データ生成サーバ３において生成された教師データに基づいて機械学習を行い、識別モデルを生成する。そして、類似度判定サーバ４は、別途存在するクエリの組み合わせについて、識別モデルに基づいて類似度を判定する。 In the similarity determination system 1, the query extraction server 2 extracts a combination of the first query and the second query from the click log used in the search engine. The teacher data generation server 3 calculates the degree of association based on the category information for the first query and the second query extracted by the query extraction server 2, and the properties of the first query and the second query, respectively. And at least one of information representing the relationship between the first query and the second query is acquired as feature information. Then, the teacher data generation server 3 generates teacher data by associating the first query, the second query, the degree of association, and the feature information. The similarity determination server 4 performs machine learning based on the teacher data generated by the teacher data generation server 3 and generates an identification model. And the similarity determination server 4 determines a similarity based on an identification model about the combination of the query which exists separately.

続いて、類似度判定システム１を構成する各サーバの機能構成について説明する。
図２は、本実施形態に係る類似度判定システム１を構成するクエリ抽出サーバ２、教師データ生成サーバ３及び類似度判定サーバ４の機能構成を示すブロック図である。 Next, the functional configuration of each server constituting the similarity determination system 1 will be described.
FIG. 2 is a block diagram showing a functional configuration of the query extraction server 2, the teacher data generation server 3, and the similarity determination server 4 constituting the similarity determination system 1 according to the present embodiment.

［クエリ抽出サーバ２の構成］
クエリ抽出サーバ２は、１又は複数の一般的なコンピュータによって構成される。一般的なコンピュータは、中央処理装置と、記憶デバイスと、通信デバイスと、入力デバイスと、表示デバイスと、これらの装置が接続されるバスラインとにより構成されている。中央処理装置は、ＣＰＵ等によって構成され、後述のクエリ抽出部２２として機能する。記憶デバイスは、メモリ（ＲＡＭ、ＲＯＭ）、ハードディスク（ＨＤＤ）及び光ディスク（ＣＤ、ＤＶＤ等）等によって構成され、後述の検索ログＤＢ２１として機能する。通信デバイスは、各種有線及び無線ＬＡＮ装置から構成される。表示デバイスは、液晶ディスプレイ、プラズマディスプレイ等の各種ディスプレイにより構成される。入力デバイスは、タッチパネルあるいはキーボード及びポインティング・デバイス（マウス、トラッキングボール等）により構成される。このような一般的なコンピュータにおいて、ＣＰＵは、クエリ抽出サーバ２を統括的に制御し、教師データ生成処理のためのプログラム等、各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 [Configuration of query extraction server 2]
The query extraction server 2 is configured by one or a plurality of general computers. A typical computer includes a central processing unit, a storage device, a communication device, an input device, a display device, and a bus line to which these devices are connected. The central processing unit is configured by a CPU or the like and functions as a query extraction unit 22 described later. The storage device includes a memory (RAM, ROM), a hard disk (HDD), an optical disk (CD, DVD, etc.), and functions as a search log DB 21 described later. The communication device includes various wired and wireless LAN devices. The display device includes various displays such as a liquid crystal display and a plasma display. The input device includes a touch panel or a keyboard and a pointing device (mouse, tracking ball, etc.). In such a general computer, the CPU comprehensively controls the query extraction server 2 and appropriately reads and executes various programs such as a program for teacher data generation processing, thereby cooperating with the hardware described above. The various functions according to the present invention are realized.

図２に示されるように、クエリ抽出サーバ２は、検索ログＤＢ２１と、クエリ抽出部２２とを備える。 As shown in FIG. 2, the query extraction server 2 includes a search log DB 21 and a query extraction unit 22.

図３は、本実施形態に係る検索ログＤＢ２１を示す図である。ここで、検索ログとは、図示しないユーザ端末のユーザが、ユーザ端末を介して検索エンジンにおいて検索を行った場合に、当該ユーザが検索結果ページにおいて一のＵＲＬを選択した履歴を示す情報である。検索ログＤＢ２１は、端末ＩＰアドレスと、検索日時と、セッションＩＤと、ランクと、検索クエリと、ＵＲＬとを関連付けて、検索ログとして記憶する。 FIG. 3 is a diagram showing the search log DB 21 according to the present embodiment. Here, the search log is information indicating a history of selecting one URL on the search result page when a user of a user terminal (not shown) performs a search in the search engine via the user terminal. . The search log DB 21 stores the terminal IP address, the search date and time, the session ID, the rank, the search query, and the URL in association with each other as a search log.

端末ＩＰアドレスは、検索を行ったユーザ端末のＩＰアドレスである。検索日時は、ユーザ端末において検索が行われた日時である。セッションＩＤは、ユーザ端末において検索が行われたときの、ユーザ端末と検索エンジンとの間のセッションＩＤである。ランクは、ユーザ端末において、検索結果から選択されたＵＲＬに対応するＷｅｂページのページランクである。検索クエリは、検索エンジンが、ユーザ端末から１回の検索で受信した１以上の検索クエリである。ＵＲＬは、検索結果ページにおいて、ユーザ端末により選択されたＵＲＬである。 The terminal IP address is the IP address of the user terminal that performed the search. The search date and time is the date and time when the search was performed on the user terminal. The session ID is a session ID between the user terminal and the search engine when a search is performed on the user terminal. The rank is the page rank of the Web page corresponding to the URL selected from the search result in the user terminal. The search query is one or more search queries received by the search engine from the user terminal in one search. The URL is a URL selected by the user terminal on the search result page.

クエリ抽出部２２は、検索ログＤＢ２１に記憶されている検索ログを参照して、２つの検索クエリを使用して検索された場合の、当該２つの検索クエリの組み合わせを抽出する。クエリ抽出部２２は、抽出した２つの検索クエリの組み合わせを教師データ生成サーバ３に送信する。 The query extraction unit 22 refers to the search log stored in the search log DB 21 and extracts a combination of the two search queries when a search is performed using the two search queries. The query extraction unit 22 transmits the combination of the extracted two search queries to the teacher data generation server 3.

［教師データ生成サーバ３の構成］
教師データ生成サーバ３は、クエリ抽出サーバ２と同様に１又は複数の一般的なコンピュータによって構成される。教師データ生成サーバ３のうち、クエリ抽出サーバ２と同様の構成については、説明を省略する。教師データ生成サーバ３の中央処理装置は、後述のクエリ受信部３２と、関連度算出部３４と、素性情報取得部３５と、記憶制御部３６として機能する。教師データ生成サーバ３の記憶デバイスは、後述の教師データＤＢ３１及びカテゴリＤＢ３３として機能する。 [Configuration of Teacher Data Generation Server 3]
The teacher data generation server 3 is configured by one or a plurality of general computers, like the query extraction server 2. Description of the same configuration as the query extraction server 2 in the teacher data generation server 3 is omitted. The central processing unit of the teacher data generation server 3 functions as a query receiving unit 32, a relevance calculation unit 34, a feature information acquisition unit 35, and a storage control unit 36, which will be described later. The storage device of the teacher data generation server 3 functions as a teacher data DB 31 and a category DB 33 described later.

図２に示されるように、教師データ生成サーバ３は、教師データＤＢ３１と、クエリ受信部３２と、カテゴリＤＢ３３と、関連度算出部３４と、素性情報取得部３５と、記憶制御部３６とを備えている。 As shown in FIG. 2, the teacher data generation server 3 includes a teacher data DB 31, a query reception unit 32, a category DB 33, a relevance calculation unit 34, a feature information acquisition unit 35, and a storage control unit 36. I have.

図４は、本実施形態に係る教師データＤＢ３１を示す図である。
図４に示されるように、教師データＤＢ３１は、教師データを記憶する。教師データは、第１クエリ、第２クエリ、関連度及び素性情報とから構成されている。
関連度は、第１クエリと第２クエリとの関連度を示す。素性情報は、第１クエリと第２クエリとのそれぞれの性質と、第１クエリ及び第２クエリの関連性を表す情報との少なくともいずれかを表す情報であり、複数の要素から構成されている。 FIG. 4 is a diagram showing the teacher data DB 31 according to the present embodiment.
As shown in FIG. 4, the teacher data DB 31 stores teacher data. The teacher data includes a first query, a second query, a degree of association, and feature information.
The degree of association indicates the degree of association between the first query and the second query. The feature information is information representing at least one of the properties of the first query and the second query and information representing the relationship between the first query and the second query, and is composed of a plurality of elements. .

クエリ受信部３２は、クエリ抽出サーバ２から、２つのクエリの組み合わせを受信することにより２つのクエリを取得する。この２つのクエリのうち、一方のクエリが第１クエリとなり、この第１クエリと異なる他方のクエリが第２クエリとなる。すなわち、クエリ受信部３２は、人手による入力操作を受け付けることなく、クエリ抽出サーバ２から自動的に、第１クエリと第２クエリとを受信する。 The query receiving unit 32 acquires two queries by receiving a combination of two queries from the query extraction server 2. Of these two queries, one query is the first query, and the other query different from the first query is the second query. That is, the query receiving unit 32 automatically receives the first query and the second query from the query extraction server 2 without receiving a manual input operation.

カテゴリＤＢ３３は、カテゴリと他のカテゴリとの親子関係を示す情報とを関連付けて記憶する。
図５は、本実施形態に係るカテゴリＤＢ３３を示す図である。図５に示されているように、カテゴリＤＢ３３は、カテゴリと、このカテゴリに直接的に接続される他のカテゴリとを関連付けて記憶する。 The category DB 33 stores information indicating a parent-child relationship between a category and another category in association with each other.
FIG. 5 is a diagram showing the category DB 33 according to the present embodiment. As shown in FIG. 5, the category DB 33 stores the category and the other category directly connected to the category in association with each other.

カテゴリは、例えば、ディレクトリ型の検索サービスにおける各ディレクトリの名称である。カテゴリＤＢ３３では、カテゴリの下位に他のカテゴリが複数関連付けられている。これにより、カテゴリＤＢ３３では、あるカテゴリが、複数のカテゴリの下位に存在する場合がある。
なお、本実施形態では、カテゴリＤＢ３３が、カテゴリと他のカテゴリとを含むこととしたが、カテゴリを識別するカテゴリコードと、他のカテゴリを識別するカテゴリコードとについても記憶させるようにしてもよい。 The category is, for example, the name of each directory in the directory type search service. In the category DB 33, a plurality of other categories are associated below the category. Thereby, in the category DB 33, a certain category may exist under a plurality of categories.
In the present embodiment, the category DB 33 includes categories and other categories. However, a category code for identifying a category and a category code for identifying another category may also be stored. .

関連度算出部３４は、カテゴリＤＢ３３に基づいて、クエリ受信部３２において受信した第１クエリに対応する第１のカテゴリと、クエリ受信部３２において受信した第２クエリに対応する第２のカテゴリとの関連度を算出する。 Based on the category DB 33, the relevance calculating unit 34 includes a first category corresponding to the first query received by the query receiving unit 32, and a second category corresponding to the second query received by the query receiving unit 32. The relevance of is calculated.

具体的には、関連度算出部３４は、クエリ受信部３２において、第１クエリと第２クエリとを受信すると、カテゴリＤＢ３３を参照して、第１クエリに対応するカテゴリと、第２クエリに対応するカテゴリとを特定する。本実施形態において、クエリに対応するカテゴリとは、クエリを構成する文字列と一致する文字列を有するカテゴリのことをいう。ここで、第１クエリに対応するカテゴリを第１のカテゴリとする。また、第２クエリに対応するカテゴリを第２のカテゴリとする。 Specifically, when the query receiving unit 32 receives the first query and the second query in the query receiving unit 32, the relevance calculation unit 34 refers to the category DB 33, and determines the category corresponding to the first query and the second query. Identify the corresponding category. In the present embodiment, a category corresponding to a query refers to a category having a character string that matches a character string constituting the query. Here, let the category corresponding to the first query be the first category. The category corresponding to the second query is set as the second category.

続いて、関連度算出部３４は、カテゴリＤＢ３３に基づいて、第１のカテゴリの上位のカテゴリであって、最上位となるカテゴリを特定する。例えば、関連度算出部３４は、抽出されたカテゴリに直接的に関連付けられている上位のカテゴリが存在しなくなるまで、上位のカテゴリを抽出する処理を繰り返す。関連度算出部３４は、抽出されたカテゴリに対して上位のカテゴリが存在しない場合、この抽出されたカテゴリを最上位のカテゴリと特定する。続いて、関連度算出部３４は、第１のカテゴリと最上位のカテゴリとのパスを特定する。この特定されたパスを第１のパスという。 Subsequently, based on the category DB 33, the degree-of-association calculation unit 34 specifies a category that is the highest category of the first category and that is the highest category. For example, the relevance calculation unit 34 repeats the process of extracting the upper category until there is no higher category that is directly associated with the extracted category. When there is no higher category than the extracted category, the relevance calculation unit 34 specifies the extracted category as the highest category. Subsequently, the degree-of-association calculation unit 34 specifies a path between the first category and the highest category. This identified path is referred to as a first path.

例えば、第１のクエリが、「スペイン」であり、図５に示されるデータがカテゴリＤＢ３３に記憶されている場合、関連度算出部３４は、第１のクエリの最上位となるカテゴリとして、「地域」を抽出する。そして、関連度算出部３４は、第１のパスを、「地域／国／スペイン」と特定する。なお、この例では、第１のパスとして１つの例を示したが、第１のパスは、複数特定されるものとする。これは、「スペイン」の上位のカテゴリ（例えば、「国」）が、複数のカテゴリの下位に存在するためである。 For example, when the first query is “Spain” and the data shown in FIG. 5 is stored in the category DB 33, the relevance calculation unit 34 selects “ Extract "region". Then, the relevance calculation unit 34 identifies the first path as “region / country / Spain”. In this example, one example is shown as the first path, but a plurality of first paths are specified. This is because an upper category (for example, “country”) of “Spain” exists under a plurality of categories.

続いて、関連度算出部３４は、第２のカテゴリに対して最上位となるカテゴリを特定する。最上位のカテゴリを特定する方法は、第１のカテゴリに対する最上位となるカテゴリを特定する方法と同一である。続いて、関連度算出部３４は、第２のカテゴリと最上位のカテゴリとのパスを特定する。この特定されたパスを第２のパスという。 Subsequently, the relevance calculation unit 34 identifies the highest category for the second category. The method for specifying the highest category is the same as the method for specifying the highest category for the first category. Subsequently, the relevance calculation unit 34 identifies the path between the second category and the highest category. This identified path is referred to as a second path.

例えば、第２のクエリが、「バルセロナ」であり、図５に示されるデータがカテゴリＤＢ３３に記憶されている場合、関連度算出部３４は、第２のクエリの最上位となるカテゴリとして、「地域」を抽出する。そして、関連度算出部３４は、第２のパスを、「地域／国／スペイン／自治体／カタロニア／市／バルセロナ」と特定する。なお、この例では、第１のパスと同様に、第２のパスが複数特定されるものとする。 For example, when the second query is “Barcelona” and the data shown in FIG. 5 is stored in the category DB 33, the relevance calculation unit 34 selects “ Extract "region". Then, the relevance calculation unit 34 identifies the second path as “region / country / Spain / local government / Catalogonia / city / Barcelona”. In this example, a plurality of second paths are specified in the same manner as the first path.

続いて、関連度算出部３４は、第１のパスと、第２のパスに基づいて、以下の処理を行うことによって、第１のカテゴリと第２のカテゴリとの関連度を算出する。
すなわち、関連度算出部３４は、第１のパスと第２のパスとについて、それぞれの最上位のカテゴリが一致している場合、この最上位のカテゴリから共通するパスの長さを算出する。そして、関連度算出部３４は、このパスの長さについて、以下に示す式（１）に基づいて、関連度を算出する。 Subsequently, the degree-of-association calculation unit 34 calculates the degree of association between the first category and the second category by performing the following processing based on the first path and the second path.
In other words, the relevance calculation unit 34 calculates the length of a common path from the highest category when the highest category matches the first path and the second path. Then, the relevance calculation unit 34 calculates the relevance for the length of the path based on the following equation (1).

ここで、ｍａｘ｛｜Ｄ｜，｜Ｄ’｜｝は、特定された複数の第１のパスのうち最も長い距離と、特定された複数の第２のパスのうち最も長い距離とによって求められる。
また、｜Ｐ（Ｄ，Ｄ’）｜は、特定された第１のパスと第２のパスとの組み合わせのうち、最上位のカテゴリから共通するパスが最も長い組み合わせにおける、共通するパスの長さである。例えば、第１のパスが、「地域／国／スペイン」であり、第２のパスが、「地域／国／スペイン／自治体／カタロニア／市／バルセロナ」であり、第１のパスと第２のパスとは、「地域／国／スペイン」で共通する。そして、この第１のパスと第２のパスとの組み合わせが、共通するパスの最も長い組み合わせである場合、｜Ｐ（Ｄ，Ｄ’）｜は、「３」となる。

Here, max {| D |, | D ′ |} is determined by the longest distance among the plurality of identified first paths and the longest distance among the plurality of identified second paths. .
| P (D, D ′) | is the length of the common path in the combination having the longest common path from the highest category among the identified combinations of the first path and the second path. That's it. For example, the first pass is “Region / Country / Spain” and the second pass is “Region / Country / Spain / Local Government / Catalania / City / Barcelona”. The first pass and the second pass A pass is common to "region / country / spain". When the combination of the first path and the second path is the longest combination of the common paths, | P (D, D ′) | is “3”.

なお、関連度算出部３４は、上述のように関連度を算出する代わりに、第１のパスと第２のパスとについて、共通するパスを特定し、この共通のパスに基づいて関連度を算出するようにしてもよい。この場合、関連度算出部３４は、このパスの長さについて、以下に示す式（２）に基づいて、関連度を算出する。

In addition, instead of calculating the degree of association as described above, the degree-of-association calculation unit 34 specifies a common path for the first path and the second path, and determines the degree of association based on the common path. You may make it calculate. In this case, the relevance calculation unit 34 calculates the relevance for the length of this path based on the following equation (2).

ここで、Ｃ（Ｄ，Ｄ’）は、特定された第１のパスと第２のパスとの組み合わせのうち、共通するパスが最も長い組み合わせにおける、共通するパスの長さである。この共通するパスは、式（１）のように、最上位のカテゴリを基点とするものではなく、第１のパスと、第２のパスとについて、パスの途中から共通するものであってもよい。 Here, C (D, D ′) is the length of the common path in the combination having the longest common path among the identified combinations of the first path and the second path. This common path is not based on the highest category as in the formula (1), and the first path and the second path may be common from the middle of the path. Good.

また、関連度算出部３４は、上述した式（１）と式（２）との双方において算出された値に基づいて、関連度を算出するようにしてもよい。例えば、関連度算出部３４は、式（１）で算出された値と、式（２）で算出された値とに対して、それぞれ重み付けを行った後に、これらの値を加算して関連度を算出する。 In addition, the relevance calculation unit 34 may calculate the relevance based on the values calculated in both the formulas (1) and (2) described above. For example, the degree-of-relevance calculation unit 34 weights the value calculated by Expression (1) and the value calculated by Expression (2), and then adds these values to add the degree of association. Is calculated.

なお、関連度算出部３４は、第１クエリに対応するカテゴリ及び第２クエリに対応するカテゴリのいずれかが特定できなかった場合、当該第１クエリと当該第２クエリとの関連度の算出に失敗したものとする。関連度の算出に失敗した場合、関連度算出部３４は、関連度の算出に続いて行われる素性情報取得部３５の処理と記憶制御部３６との処理を省略する。これにより、教師データ生成サーバ３は、関連度を算出することができなかった第１クエリと第２クエリとを教師データとして採用しない。これにより、教師データ生成サーバ３は、関連度が不明な情報が教師データに含まれることを防止することができ、教師データの精度を向上させることができる。 The relevance calculating unit 34 calculates the relevance between the first query and the second query when one of the category corresponding to the first query and the category corresponding to the second query cannot be specified. Suppose that it failed. When the calculation of the degree of association fails, the degree-of-association calculation unit 34 omits the processing of the feature information acquisition unit 35 and the processing of the storage control unit 36 that are performed following the calculation of the degree of association. Thereby, the teacher data generation server 3 does not employ the first query and the second query for which the degree of association could not be calculated as teacher data. Thereby, the teacher data generation server 3 can prevent information with unknown relevance from being included in the teacher data, and can improve the accuracy of the teacher data.

素性情報取得部３５は、第１クエリ及び第２クエリそれぞれの性質を表す素性情報を取得する。
具体的には、素性情報取得部３５は、第１クエリと第２クエリとのそれぞれについて、ファセット抽出特性、クエリテキスト特性、結果クリック特性、クエリセッション共起特性を含む素性情報を後述のクエリＤＢ４２から取得する。 The feature information acquisition unit 35 acquires feature information representing the properties of the first query and the second query.
Specifically, the feature information acquisition unit 35 obtains feature information including a facet extraction characteristic, a query text characteristic, a result click characteristic, and a query session co-occurrence characteristic for each of the first query and the second query. Get from.

ファセット抽出特性とは、例えば、検索ログＤＢ２１における第１クエリと第２クエリとの、一度に用いられた確率、検索キーワードの先頭で用いられた確率、検索キーワードの先頭以外で用いられた確率、同一セッション内での存在した確率、クリックされた頻度等を示す情報である。 The facet extraction characteristics are, for example, the probability that the first query and the second query in the search log DB 21 are used at once, the probability that is used at the top of the search keyword, the probability that is used at other than the top of the search keyword, This is information indicating the probability of being present in the same session, the frequency of clicking, and the like.

クエリテキスト特性は、例えば、第１クエリ及び第２クエリそれぞれの、文字の長さ、構成される単語の量、マルチバイト文字基準によるレーベンシュタイン距離等を示す情報である。 The query text characteristic is information indicating, for example, the length of characters, the amount of words configured, the Levenshtein distance based on a multibyte character criterion, and the like for each of the first query and the second query.

結果クリック特性は、例えば、第１クエリ及び第２クエリそれぞれの、検索結果における情報量に基づいて算出される情報である。 The result click characteristic is information calculated based on the amount of information in the search results of each of the first query and the second query, for example.

なお、素性情報取得部３５は、素性情報をクエリＤＢ４２から取得することとしたが、これに限らない。例えば、素性情報取得部３５は、クエリテキスト特定について、第１のクエリ、第２のクエリそれぞれから算出するようにしてもよい。 In addition, although the feature information acquisition part 35 decided to acquire feature information from query DB42, it is not restricted to this. For example, the feature information acquisition unit 35 may calculate the query text from each of the first query and the second query.

記憶制御部３６は、クエリ受信部３２により受信した第１クエリ及び第２クエリ、関連度算出部３４により算出された関連度、並びに素性情報取得部３５により取得された素性情報を関連付けて教師データとし、この教師データを教師データＤＢ３１に記憶させる。 The storage control unit 36 associates the first query and the second query received by the query receiving unit 32, the relevance calculated by the relevance calculation unit 34, and the feature information acquired by the feature information acquisition unit 35 with teacher data. The teacher data is stored in the teacher data DB 31.

［類似度判定サーバ４の構成］
類似度判定サーバ４は、クエリ抽出サーバ２、教師データ生成サーバ３と同様に１又は複数の一般的なコンピュータによって構成される。類似度判定サーバ４のうち、クエリ抽出サーバ２と同様の構成については、説明を省略する。類似度判定サーバ４の中央処理装置は、後述のモデル生成部４１と、類似度判定部４３として機能する。類似度判定サーバ４の記憶デバイスは、後述のクエリＤＢ４２として機能する。 [Configuration of similarity determination server 4]
Similarity determination server 4 is configured by one or a plurality of general computers, similar to query extraction server 2 and teacher data generation server 3. Description of the configuration similar to that of the query extraction server 2 in the similarity determination server 4 is omitted. The central processing unit of the similarity determination server 4 functions as a model generation unit 41 and a similarity determination unit 43 described later. The storage device of the similarity determination server 4 functions as a query DB 42 described later.

図２に示されるように、類似度判定サーバ４は、モデル生成部４１と、クエリＤＢ４２と、類似度判定部４３とを備えている。 As shown in FIG. 2, the similarity determination server 4 includes a model generation unit 41, a query DB 42, and a similarity determination unit 43.

モデル生成部４１は、教師データＤＢ３１を参照して、機械学習を行い、第１クエリと第２クエリとの識別モデルを生成する。具体的には、モデル生成部４１は、教師データＤＢ３１に記憶されている教師データに基づいて、第１クエリ、第２クエリ及び素性データの値と、関連度（類似度）との関係性について機械学習を行い、識別モデルを生成する。モデル生成部４１は、教師データＤＢ３１に記憶されている全てのデータに基づいて識別モデルを生成する。 The model generation unit 41 performs machine learning with reference to the teacher data DB 31 to generate an identification model between the first query and the second query. Specifically, the model generation unit 41 determines the relationship between the values of the first query, the second query, and the feature data and the relevance (similarity) based on the teacher data stored in the teacher data DB 31. Machine learning is performed to generate an identification model. The model generation unit 41 generates an identification model based on all data stored in the teacher data DB 31.

図６は、本実施形態に係るクエリＤＢ４２を示す図である。図６に示されているように、クエリＤＢ４２は、第１クエリと、第２クエリと、素性情報とを関連付けて記憶する。すなわち、本実施形態において、クエリＤＢ４２に記憶されている第１クエリと第２クエリとは、予め素性情報が算出されているものの、類似性が未知であるものとする。 FIG. 6 is a diagram showing the query DB 42 according to the present embodiment. As shown in FIG. 6, the query DB 42 stores the first query, the second query, and the feature information in association with each other. That is, in the present embodiment, it is assumed that the similarity between the first query and the second query stored in the query DB 42 is unknown although the feature information is calculated in advance.

類似度判定部４３は、モデル生成部４１により生成された識別モデル及びクエリＤＢ４２に記憶されている素性情報を用いて、クエリＤＢ４２に記憶されている第１クエリ及び第２クエリの関連度（類似度）を判定する。なお、類似度判定部４３は、他のサーバから、第１クエリ及び第２クエリを受け付け、受け付けた第１クエリ及び第２クエリの関連度を判定するようにしてもよい。 The similarity determination unit 43 uses the identification model generated by the model generation unit 41 and the feature information stored in the query DB 42 to use the relevance (similarity of the first query and the second query stored in the query DB 42). Degree). The similarity determination unit 43 may receive the first query and the second query from another server and determine the relevance of the received first query and second query.

［動作］
次に、図７及び図８を参照して、類似度判定システム１の動作を説明する。
図７は、クエリ抽出サーバ２及び教師データ生成サーバ３が実行する教師データ生成処理の流れを示すフローチャートである。 [Operation]
Next, the operation of the similarity determination system 1 will be described with reference to FIGS.
FIG. 7 is a flowchart showing the flow of teacher data generation processing executed by the query extraction server 2 and the teacher data generation server 3.

［クエリ抽出サーバ２及び教師データ生成サーバ３の動作］
クエリ抽出サーバ２のクエリ抽出部２２は、検索ログＤＢ２１に記憶されている検索ログを参照して、２つの検索クエリを抽出する（ステップＳ１）。
続いて、クエリ抽出サーバ２のクエリ抽出部２２は、ステップＳ１において抽出した２つの検索クエリを教師データ生成サーバ３に送信する（ステップＳ２）。 [Operations of Query Extraction Server 2 and Teacher Data Generation Server 3]
The query extraction unit 22 of the query extraction server 2 refers to the search log stored in the search log DB 21 and extracts two search queries (step S1).
Subsequently, the query extraction unit 22 of the query extraction server 2 transmits the two search queries extracted in step S1 to the teacher data generation server 3 (step S2).

続いて、教師データ生成サーバ３のクエリ受信部３２は、クエリ抽出サーバ２から、２つのクエリを受信する（ステップＳ３）。
続いて、教師データ生成サーバ３の関連度算出部３４は、カテゴリＤＢ３３に基づいて、ステップＳ３において受け付けた第１クエリに対応する第１のカテゴリと、ステップＳ３において受け付けた第２クエリに対応する第２のカテゴリとの関連度を算出する（ステップＳ４）。 Subsequently, the query receiving unit 32 of the teacher data generation server 3 receives two queries from the query extraction server 2 (step S3).
Subsequently, the relevance calculation unit 34 of the teacher data generation server 3 corresponds to the first category corresponding to the first query accepted in step S3 and the second query accepted in step S3 based on the category DB 33. The degree of association with the second category is calculated (step S4).

続いて、教師データ生成サーバ３の素性情報取得部３５は、クエリＤＢ４２から、第１クエリ及び第２クエリそれぞれの性質を表す素性情報を取得する（ステップＳ５）。なお、ステップＳ４の処理とステップＳ５の処理との順番は、説明した順番と逆の順番であってもよい。 Subsequently, the feature information acquisition unit 35 of the teacher data generation server 3 acquires feature information representing the properties of the first query and the second query from the query DB 42 (step S5). Note that the order of the process of step S4 and the process of step S5 may be the reverse order of the order described.

続いて、教師データ生成サーバ３の記憶制御部３６は、ステップＳ３において受信した第１クエリ及び第２クエリ、ステップＳ４において算出された関連度、並びにステップＳ５において取得された素性情報を関連付けて教師データとし、この教師データを教師データＤＢ３１に記憶させる（ステップＳ６）。 Subsequently, the storage control unit 36 of the teacher data generation server 3 associates the first query and the second query received in step S3, the relevance calculated in step S4, and the feature information acquired in step S5 with the teacher. This teacher data is stored in the teacher data DB 31 (step S6).

［類似度判定サーバ４の動作］
図８は、類似度判定サーバ４が実行する類似度判定処理の流れを示すフローチャートである。
類似度判定サーバ４のモデル生成部４１は、教師データＤＢ３１に記憶されている教師データに基づいて、第１クエリ、第２クエリ及び素性データとの値と、類似度との関係性について機械学習を行い、識別モデルを生成する（ステップＳ１１）。 [Operation of similarity determination server 4]
FIG. 8 is a flowchart showing the flow of similarity determination processing executed by the similarity determination server 4.
The model generation unit 41 of the similarity determination server 4 performs machine learning on the relationship between the values of the first query, the second query, and the feature data and the similarity based on the teacher data stored in the teacher data DB 31. To generate an identification model (step S11).

類似度判定サーバ４の類似度判定部４３は、モデル生成部４１により生成された識別モデル及びクエリＤＢ４２に記憶されている素性情報を用いて、クエリＤＢ４２に記憶されている第１クエリ及び第２クエリの類似度（関連度）を判定する（ステップＳ１２）。 The similarity determination unit 43 of the similarity determination server 4 uses the identification model generated by the model generation unit 41 and the feature information stored in the query DB 42 to use the first query and the second query stored in the query DB 42. The similarity (relevance) of the query is determined (step S12).

以上のように、本実施形態に係る類似度判定システム１において、教師データ生成サーバ３は、関連度算出部３４により、カテゴリＤＢ３３に基づいて、クエリ受信部３２において受信した第１クエリに対応する第１のカテゴリと、クエリ受信部３２において受信した第２クエリに対応する第２のカテゴリとの関連度を算出する。そして、教師データ生成サーバ３は、素性情報取得部３５により、第１クエリ及び第２クエリそれぞれの性質を表す素性情報を取得し、記憶制御部３６により、第１クエリ、第２クエリ、関連度及び素性情報を関連付けて教師データとして教師データＤＢ３１に記憶させる。 As described above, in the similarity determination system 1 according to the present embodiment, the teacher data generation server 3 corresponds to the first query received by the query reception unit 32 based on the category DB 33 by the relevance calculation unit 34. The degree of association between the first category and the second category corresponding to the second query received by the query receiving unit 32 is calculated. Then, the teacher data generation server 3 acquires feature information representing the properties of the first query and the second query by the feature information acquisition unit 35, and the first query, the second query, and the relevance degree by the storage control unit 36. And the feature information are associated with each other and stored in the teacher data DB 31 as teacher data.

このように、教師データ生成サーバ３は、人手により教師データの生成を行うことなく、教師データを自動的に生成することができるので、効率的に教師データを生成することが可能となる。 As described above, the teacher data generation server 3 can automatically generate teacher data without manually generating teacher data, so that teacher data can be generated efficiently.

また、教師データ生成サーバ３は、関連度算出部３４により、カテゴリＤＢ３３に基づいて、第１のカテゴリと、第１のカテゴリに対して最上位となるカテゴリとのパスと、第２のカテゴリと、第２のカテゴリに対して最上位となるカテゴリとのパスとを特定し、特定されたパスに基づいて関連度を算出する。よって、教師データ生成サーバ３は、カテゴリＤＢ３３に記憶されているカテゴリの階層構造を利用して、関連度を算出することが可能となる。 In addition, the teacher data generation server 3 causes the relevance calculation unit 34 to determine, based on the category DB 33, the path of the first category, the highest category with respect to the first category, the second category, The path with the category that is the highest level with respect to the second category is specified, and the degree of association is calculated based on the specified path. Therefore, the teacher data generation server 3 can calculate the degree of association using the hierarchical structure of categories stored in the category DB 33.

また、教師データ生成サーバ３は、関連度算出部３４により、特定されたパスのうち最上位のカテゴリから共通するパスの長さに基づいて関連度を算出する。最上位のカテゴリからパスが共通する場合、これらのパスに対応する第１のカテゴリと第２のカテゴリとは、上下関係にあるため、最上位のカテゴリからパスが共通していない場合に比べて類似しているといえる。よって、教師データ生成サーバ３は、第１クエリと第２クエリとの関連度を精度よく算出することが可能となる。 In addition, the teacher data generation server 3 calculates the relevance degree based on the common path length from the highest category among the identified paths by the relevance degree calculation unit 34. When the path from the highest category is common, the first category and the second category corresponding to these paths are in a vertical relationship, so compared to the case where the path from the highest category is not common. It can be said that they are similar. Therefore, the teacher data generation server 3 can accurately calculate the degree of association between the first query and the second query.

また、教師データ生成サーバ３は、関連度算出部３４により、特定されたパスにおいて共通するパスの長さに基づいて関連度を算出する。パスが部分的に共通する場合、これらのパスに対応する第１のカテゴリと第２のカテゴリとは、あるカテゴリの下位に共通して存在しているといえる。このため、第１のカテゴリと第２のカテゴリとは、パスが共通していない場合に比べて類似しているといえる。よって、教師データ生成サーバ３は、第１クエリと第２クエリとの関連度を精度よく算出することが可能となる。 In addition, the teacher data generation server 3 calculates the relevance level based on the common path length in the specified path by the relevance level calculation unit 34. When the paths are partially common, it can be said that the first category and the second category corresponding to these paths exist in common under a certain category. For this reason, it can be said that the first category and the second category are similar as compared to the case where the paths are not common. Therefore, the teacher data generation server 3 can accurately calculate the degree of association between the first query and the second query.

また、教師データ生成サーバ３は、関連度算出部３４により、特定されたパスのうち最上位のカテゴリから共通するパスの長さと、特定されたパスにおいて共通するパスの長さとに基づいて関連度を算出する。このようにすることで、教師データ生成サーバ３は、最上位のカテゴリからのパスの共通性と、パスの部分的な共通性との双方に基づいて関連度を算出することが可能となる。 Further, the teacher data generation server 3 uses the relevance calculation unit 34 to determine the relevance based on the common path length from the highest category among the identified paths and the common path length in the identified paths. Is calculated. In this way, the teacher data generation server 3 can calculate the degree of association based on both the commonality of the path from the highest category and the partial commonality of the path.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

例えば、本実施形態では、教師データ生成サーバ３は、１つのサーバとして実現される場合を例に挙げて説明したが、教師データ生成サーバ３の機能を複数のサーバに分散して実装し、これらのサーバ全体が連携することによって教師データ生成サーバ３の機能を実現することも可能である。 For example, in the present embodiment, the case where the teacher data generation server 3 is realized as a single server has been described as an example. However, the functions of the teacher data generation server 3 are distributed and implemented in a plurality of servers. It is also possible to realize the function of the teacher data generation server 3 by cooperation of the entire servers.

また、本実施形態では、教師データ生成サーバ３において、カテゴリＤＢ３３を記憶することとしたが、教師データ生成サーバ３と異なるサーバに対してカテゴリＤＢ３３を記憶するようにしてもよい。この場合、教師データ生成サーバ３は、カテゴリＤＢ３３が記憶されたサーバにアクセスして、このサーバに記憶されたカテゴリＤＢ３３に基づいて関連度を算出する。 Further, in the present embodiment, the category DB 33 is stored in the teacher data generation server 3, but the category DB 33 may be stored in a server different from the teacher data generation server 3. In this case, the teacher data generation server 3 accesses the server in which the category DB 33 is stored, and calculates the degree of association based on the category DB 33 stored in the server.

また、本実施形態では、類似度判定システム１は、クエリ抽出サーバ２と、教師データ生成サーバ３と、類似度判定サーバ４との３台のサーバから構成されるものとしたが、これに限らない。例えば、類似度判定システム１は、これらのサーバが備える機能を１台のサーバで実現するようにしてもよいし、４台以上のサーバで機能を分散して実現するようにしてもよい。 In the present embodiment, the similarity determination system 1 is composed of three servers, that is, the query extraction server 2, the teacher data generation server 3, and the similarity determination server 4. However, the present invention is not limited to this. Absent. For example, the similarity determination system 1 may realize the functions provided by these servers with a single server, or may distribute the functions with four or more servers.

また、本実施形態では、教師データ生成サーバ３において、生成された教師データを教師データＤＢ３１に記憶させることとしたが、教師データ生成サーバ３と異なるサーバに対して教師データを記憶するようにしてもよい。この場合、類似度判定サーバ４は、教師データが記憶されたサーバにアクセスして、このサーバに記憶された教師データに基づいて識別モデルを生成する。 In the present embodiment, the teacher data generation server 3 stores the generated teacher data in the teacher data DB 31. However, the teacher data is stored in a server different from the teacher data generation server 3. Also good. In this case, the similarity determination server 4 accesses a server in which teacher data is stored, and generates an identification model based on the teacher data stored in the server.

また、本実施形態では、関連度算出部３４は、関連度の算出に失敗した場合、関連度の算出に続いて行われる素性情報取得部３５の処理と記憶制御部３６との処理を省略するものとしたが、これに限らない。例えば、関連度算出部３４は、関連度の算出に失敗した場合、失敗したことを示すエラー情報を素性情報取得部３５及び記憶制御部３６に出力するようにしてもよい。そして、素性情報取得部３５と記憶制御部３６とのそれぞれにおいて、エラー情報を受信した場合に、それぞれの機能に係る処理を行わないようにしてもよい。 In the present embodiment, when the relevance calculation fails, the relevance calculation unit 34 omits the processing of the feature information acquisition unit 35 and the storage control unit 36 that are performed following the relevance calculation. Although it was assumed, it is not limited to this. For example, when the relevance calculation unit 34 fails to calculate the relevance level, error information indicating the failure may be output to the feature information acquisition unit 35 and the storage control unit 36. And in each of the feature information acquisition part 35 and the storage control part 36, when error information is received, you may make it not perform the process which concerns on each function.

１類似度判定システム
２クエリ抽出サーバ
２１検索ログＤＢ
２２クエリ抽出部
３教師データ生成サーバ
３１教師データＤＢ
３２クエリ受信部
３３カテゴリＤＢ
３４関連度算出部
３５素性情報取得部
３６記憶制御部
４類似度判定サーバ
４１モデル生成部
４２クエリＤＢ
４３類似度判定部 1 similarity determination system 2 query extraction server 21 search log DB
22 Query extraction unit 3 Teacher data generation server 31 Teacher data DB
32 Query receiver 33 Category DB
34 relevance calculation unit 35 feature information acquisition unit 36 storage control unit 4 similarity determination server 41 model generation unit 42 query DB
43 Similarity determination unit

Claims

A teacher data generation device that generates teacher data used for machine learning,
Query acquisition means for acquiring a first query and a second query different from the first query;
A first category corresponding to the first query acquired by the query acquisition unit based on a category information storage unit that stores a category and information indicating a parent-child relationship between the category and another category; Relevance calculation means for calculating relevance with the second category corresponding to the second query acquired by the query acquisition means;
Feature information acquisition means for acquiring, as feature information, at least one of information representing the properties of the first query and the second query and information representing the relevance of the first query and the second query;
Storage control means for associating the first query, the second query, the relevance and the feature information and storing them in a predetermined storage means as teacher data;
A teacher data generation device comprising:

Based on the category information storage unit, the relevance calculation unit is configured to pass the first category, a path of a category that is the highest level with respect to the first category, the second category, and the second category. A path with a category that is the highest level for the two categories is calculated, and the relevance is calculated based on the specified path.
The teacher data generation device according to claim 1.

The relevance calculating means calculates the relevance based on a common path length from the highest category among the identified paths.
The teacher data generation device according to claim 2.

The relevance calculating means calculates the relevance based on a common path length in the identified path;
The teacher data generation device according to claim 2.

The relevance calculation means calculates the relevance based on a common path length from the highest category among the identified paths and a common path length in the identified paths.
The teacher data generation device according to claim 2.

A teacher data generation method in which a computer executes generation of teacher data used for machine learning,
A query acquisition step of acquiring a first query and a second query different from the first query;
A first category corresponding to the first query acquired in the query acquisition step based on category information storage means for storing the category and information indicating a parent-child relationship between the category and another category in association with each other; A relevance calculation step of calculating relevance with the second category corresponding to the second query acquired in the query acquisition step;
A feature information acquisition step of acquiring, as feature information, at least one of feature information representing the properties of each of the first query and the second query and information representing a relationship between the first query and the second query;
A storage control step of associating the first query, the second query, the degree of association, and the feature information with each other and storing them in a predetermined storage unit as teacher data;
A teacher data generation method in which a computer executes.

A teacher data generation program for causing a computer to generate teacher data used for machine learning,
A query acquisition step of acquiring a first query and a second query different from the first query;
A first category corresponding to the first query acquired in the query acquisition step based on category information storage means for storing the category and information indicating a parent-child relationship between the category and another category in association with each other; A relevance calculation step of calculating relevance with the second category corresponding to the second query acquired in the query acquisition step;
A feature information acquisition step of acquiring, as feature information, at least one of feature information representing the properties of each of the first query and the second query and information representing a relationship between the first query and the second query;
A storage control step of associating the first query, the second query, the degree of association, and the feature information with each other and storing them in a predetermined storage unit as teacher data;
A teacher data generation program for causing a computer to execute.