JP7477744B2

JP7477744B2 - Information processing device, control method, and program

Info

Publication number: JP7477744B2
Application number: JP2019198689A
Authority: JP
Inventors: 敬己下郡山
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2024-05-02
Anticipated expiration: 2039-10-31
Also published as: JP2021071956A

Description

本発明は、データの特徴に応じた学習データ作成および学習モデル生成の技術に関する。 The present invention relates to technology for creating learning data and generating learning models according to the characteristics of the data.

従来からユーザに対して適切な検索結果を提示するため、検索条件と文書群に含まれる各文書に含まれるターム（形態素解析、Ｎ－Ｇｒａｍなど一定の基準で切り出した文字列）の関連性を統計値として算出する技術がある。これらの技術を類似検索などと呼ぶ（以下、本発明の説明において、当該技術を統一的に類似検索と呼ぶこととし、本願発明における後述の順位学習による検索とは区別することにする）。 In order to present users with appropriate search results, there are conventional techniques that calculate the relevance between search criteria and terms (character strings extracted using certain criteria such as morphological analysis or N-Grams) contained in each document in a document set as statistical values. These techniques are called similarity search (hereinafter, in the explanation of this invention, this technique will be referred to uniformly as similarity search, and will be distinguished from the search by rank learning described later in this invention).

また、学習データと検索対象となる文書群が類似する場合の特徴量を機械学習によりモデル化し、新たな検索条件が指定された場合に、当該学習モデルに基づきランキング調整をすることで、類似検索の精度を向上させる順位学習の技術がある。 There is also a ranking learning technology that uses machine learning to model features when the learning data and the group of documents to be searched are similar, and when new search conditions are specified, adjusts the rankings based on the learning model, thereby improving the accuracy of similarity searches.

順位学習には大量の学習データが必要であるが、学習データの収集は困難である。類似検索をシステムとして運用開始した後にユーザの検索ログから学習データを収集することも考えられるが、検索結果の評価にはユーザの負荷がかかることもあり、十分な量のログ収集が可能とは言い切れない。また運用開始前には、開発者がテスト用に作成した学習データなどに限定される。 Ranking learning requires a large amount of learning data, but collecting learning data is difficult. It is possible to collect learning data from users' search logs after similarity search is put into operation as a system, but evaluating search results can place a burden on users, so it is not certain that a sufficient amount of logs can be collected. Furthermore, before operation begins, learning data is limited to data created by developers for testing.

特許文献１は、予め用意された回答（いわばＦＡＱの文書群）に対して、ユーザからの問い合わせに対して最も類似した質問（学習データの質問文）を見つけ、対応する回答を返す技術に対して、質問文が少ない場合でもトピック推定精度を高める技術を提供している。 Patent Document 1 provides a technology that finds the most similar question (a question sentence from the learning data) to a user's inquiry from a set of pre-prepared answers (a group of FAQ documents, so to speak) and returns the corresponding answer, and improves the accuracy of topic estimation even when there are only a few question sentences.

具体的には、学習データの質問文に現れる単語に対して、対応する回答内の単語に置換することによって、学習データの質問文を拡張する、すなわち学習データの件数を増やしている。また拡充した質問文のうち不自然な質問文を除外するため、確率言語モデルを用いて質問文の存在確率を計算し、存在確率がある閾値を超える場合のみ学習データとして用いるとしている。 Specifically, the questions in the training data are expanded by replacing words that appear in the questions with the corresponding words in the answers, i.e., the number of training data items is increased. In addition, to eliminate unnatural questions from the expanded questions, a probabilistic language model is used to calculate the probability of a question's existence, and only those questions that exceed a certain threshold are used as training data.

特開２０１７－３７５８８号公報JP 2017-37588 A

しかしながら、特許文献１の技術においては、確率言語モデルを用いて拡充された質問文が適切であるか否かを判定しているが、置換された単語はあくまで予め用意された回答に含まれるものであり、専門用語やある組織特有の用語が使用されている可能性がある。その場合、確率言語モデルでは事例が不足していて、質問文が適切に拡充されない場合も発生する。 However, in the technology of Patent Document 1, a probabilistic language model is used to determine whether the expanded question sentence is appropriate, but the replaced words are merely those included in the answers prepared in advance, and there is a possibility that technical terms or terms specific to a certain organization are used. In such cases, there may be cases where the probabilistic language model does not have enough examples and the question sentence is not expanded appropriately.

さらに特許文献１の技術においては、学習データとして用いる質問文を拡充させることで学習効果を高めること目的である。しかしながら学習データの件数が増加すると学習に要する計算時間が膨大になり、実用的ではなくなってしまうことある。 Furthermore, the technology of Patent Document 1 aims to improve the learning effect by expanding the number of questions used as learning data. However, as the number of learning data items increases, the calculation time required for learning becomes enormous, and this may become impractical.

本発明の目的は、同様のデータが含まれるグループに分類されたデータに基づく学習データを作成する技術を提供することである。 An object of the present invention is to provide a technique for creating learning data based on data classified into groups containing similar data .

本発明は、学習モデルを管理する情報処理装置であって、学習に用いるデータをグループに分類する分類手段と、前記グループに分類されたデータに基づいて、当該グループに分類されたデータを示す特徴データを決定する決定手段と、前記グループに分類されたデータを示す特徴データに基づいてグループどうしを結合して新たなグループを作成する結合手段と、前記結合されたグループに分類されたデータを用いて生成される学習モデルと、前記決定された当該グループに係る特徴データとを登録する登録手段と、を備えることを特徴とする。 The present invention is an information processing device that manages a learning model, and is characterized by comprising a classification means that classifies data used for learning into groups, a determination means that determines feature data indicative of data classified into the groups based on the data classified into the groups, a combination means that combines groups based on the feature data indicative of the data classified into the groups to create new groups, and a registration means that registers a learning model generated using data classified into the combined group and the feature data related to the determined group.

本発明により、同様のデータが含まれるグループに分類されたデータに基づく学習データを作成する技術を提供することが可能となる。 The present invention makes it possible to provide a technique for creating learning data based on data classified into groups containing similar data .

本発明の実施形態に係る機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration according to an embodiment of the present invention. 本発明の実施形態に係る情報処理装置１００に適用可能なハードウェア構成の一例を示すブロック図である。1 is a block diagram showing an example of a hardware configuration applicable to an information processing apparatus 100 according to an embodiment of the present invention. 本発明の実施形態に係わる検索条件の入力と検索結果から正解を指定するユーザインタフェースの一例である。13 is an example of a user interface for inputting search conditions and specifying a correct answer from search results according to an embodiment of the present invention. 本発明の実施形態に係わる類似検索の検索結果の一例である。13 is an example of a search result of a similarity search according to the embodiment of the present invention. 本発明の実施形態に係る検索結果をクラスタリングした結果の一例である。13 is an example of a result of clustering search results according to an embodiment of the present invention. 本発明の実施形態に係わるクラスタリングを用いてドリルダウンした結果の一例である。13 is a diagram showing an example of a result of drilling down using clustering according to an embodiment of the present invention. 本発明の実施形態に係る学習データ記憶部に登録された学習データの構造の一例である。4 is an example of a structure of learning data registered in a learning data storage unit according to the embodiment of the present invention. 本発明の実施形態に係る学習モデル生成の処理を説明するフローチャートの一例である。1 is an example of a flowchart illustrating a learning model generation process according to an embodiment of the present invention. 本発明の実施形態に係る学習データのグループ類似度とグループ化を説明する図の一例である。FIG. 1 is a diagram illustrating an example of group similarity and grouping of learning data according to an embodiment of the present invention. 本発明の実施形態に係る学習モデルを記憶する際のデータ構造と検索時に選択されたクラスタの類似度計算を説明する図の一例である。1 is an example of a diagram illustrating a data structure for storing a learning model according to an embodiment of the present invention and a similarity calculation for a cluster selected during a search. 本発明の実施形態に係る検索処理を説明するフローチャートの一例である。1 is a flowchart illustrating a search process according to an embodiment of the present invention.

本発明においては機械学習により従来型の文書の検索結果を、機械学習を利用して検索順位を改めて指定し直す。これを順位学習などと呼ぶ。特に本発明では説明の便宜上、事前に学習モデルを決定する処理を“学習モデルの生成”、実際にユーザなどの検索条件に基づく検索結果を、生成された学習モデルを用いて順位を指定し直す処理を“再ランク付け”と呼ぶことにする。 In this invention, machine learning is used to re-assign the search ranking of conventional document search results. This is called rank learning. For the sake of convenience in this invention, the process of determining a learning model in advance is called "learning model generation," and the process of actually re-assigning the ranking of search results based on user search criteria, etc., using the generated learning model is called "re-ranking."

本発明の特徴は、次の３点にある。まず分類情報がない文書群に対して、後述の学習モデルを生成するために学習データをどのように記憶させるかということである。２点目に前記の学習データを元に、学習データとなる文書群をどのように一部に限定して学習モデルを生成するかである。３点目に検索時に動的にクラスタリングされた文書群の再ランク付けに際し、複数ある学習モデルの中から適切な学習モデルをいかに選択するかである。 The present invention has three features. First, it is about how to store learning data for a group of documents that does not have classification information in order to generate a learning model (described below). Second, it is about how to limit the group of documents that will become the learning data to a portion based on the above-mentioned learning data to generate a learning model. Third, it is about how to select an appropriate learning model from among multiple learning models when re-ranking a group of documents that have been dynamically clustered during a search.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。 The following describes in detail the embodiments of the present invention with reference to the drawings.

図１は、本発明の実施形態に係る機能構成の一例を示す図である。検索条件受付部１０１は、検索ユーザまたは他のプログラムから検索条件（文字列）を受け付けて、類似検索部１０２に送る。類似検索部１０２は、文書記憶部１２１を検索して検索条件に記載された条件にヒットした検索結果、すなわち文書一覧を取得する。この検索処理は単語の出現頻度などに基づき検索条件と文書の類似度を計算しその上位の文書を検索結果の文書一覧にする、など様々な周知の技術があり説明を省略する。 Figure 1 is a diagram showing an example of a functional configuration according to an embodiment of the present invention. A search condition receiving unit 101 receives search conditions (character strings) from a search user or another program, and sends them to a similarity search unit 102. The similarity search unit 102 searches the document storage unit 121 to obtain search results that match the conditions described in the search conditions, i.e., a list of documents. This search process involves calculating the similarity between the search conditions and documents based on the frequency of word appearances, and other factors, and then creating a document list of the top documents as search results. There are various well-known techniques for this search process, and a detailed description will be omitted here.

前記検索結果である文書一覧はクラスタリング部１０３に渡され、自然言語処理にて各文書の類似度に基づきクラスタに分割する。クラスタリングについても周知の技術であり説明を省略する。またクラスタリングでは、１つの文書を必ず１つのクラスタに分類する方式と、１つの文書が複数のクラスタに含まれることを許容する方式があるが、本願発明ではそのいずれを用いても良い。 The list of documents resulting from the search is passed to the clustering unit 103, which divides the documents into clusters based on the similarity of each document using natural language processing. Clustering is also a well-known technique, and so a detailed explanation is omitted. In addition, there are methods for clustering in which one document is always classified into one cluster, and methods in which one document is allowed to be included in multiple clusters, and either method may be used in the present invention.

表示部１０４は、前記クラスタを表示して例えばユーザにクラスタのうちの１つを選択させる。ユーザが選択したクラスタを受け付けて当該クラスタに分類された文書群（前記検索結果の文書一覧の一部）を表示する。その際に、学習モデル選択部１０５はユーザが選択したクラスタに応じて、学習モデル記憶部１２３から適切な学習モデルを選択し、当該学習モデルに従って前記文書群を再ランク付けして表示部１０４に表示する。 The display unit 104 displays the clusters and allows the user to select one of the clusters, for example. It accepts the cluster selected by the user and displays the documents classified into that cluster (part of the document list of the search results). At this time, the learning model selection unit 105 selects an appropriate learning model from the learning model storage unit 123 according to the cluster selected by the user, re-ranks the documents according to the learning model, and displays them on the display unit 104.

また前記表示部１０４でユーザが選択したクラスタに含まれる前記前記文書群から例えばユーザに１つの文書を選択させ、当該文書を前記検索条件に対する正解として学習データ登録部１０７に渡し、当該学習データ登録部１０７は学習データを構成して学習データ記憶部１２２に登録する。 The display unit 104 also allows the user to select, for example, one document from the group of documents included in the cluster selected by the user, and passes that document to the learning data registration unit 107 as the correct answer to the search criteria, and the learning data registration unit 107 then composes learning data and registers it in the learning data storage unit 122.

学習モデル生成部１０８は、学習データ記憶部１２２に記憶された学習データを用いて再ランク付けのための学習モデルを生成する。全ての学習データを用いて１つの学習モデルを生成するのではなく、生成モデル決定部１０９は当該学習データが登録された際のクラスタに含まれる文書群に関する情報を用いて学習データをグループ化し、そのグループに基づいて学習モデル生成部１０８が当該グループ毎に学習モデルを生成する。 The learning model generation unit 108 generates a learning model for reranking using the learning data stored in the learning data storage unit 122. Instead of generating one learning model using all the learning data, the generation model determination unit 109 groups the learning data using information about the document group included in the cluster when the learning data was registered, and the learning model generation unit 108 generates a learning model for each group based on the groups.

図２は、本発明の実施形態に係る情報処理装置１００に適用可能なハードウェア構成の一例を示すブロック図である。 Figure 2 is a block diagram showing an example of a hardware configuration that can be applied to the information processing device 100 according to an embodiment of the present invention.

図２に示すように、情報処理装置１００は、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。 As shown in FIG. 2, the information processing device 100 has a configuration in which a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, an input controller 205, a video controller 206, a memory controller 207, a communication I/F controller 208, etc. are connected via a system bus 204.

ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 The CPU 201 provides overall control over the devices and controllers connected to the system bus 204.

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 The ROM 203 or external memory 211 also stores the BIOS (Basic Input/Output System) and OS (Operating System), which are control programs for the CPU 201, as well as various programs (described below) required to realize the functions executed by each server or each PC. Information required to implement the present invention is also stored. The external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 RAM 202 functions as the main memory, work area, etc. of CPU 201. CPU 201 loads programs and the like required for executing processing from ROM 203 or external memory 211 into RAM 202, and executes the loaded programs to realize various operations.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 also controls input from a keyboard (KB) 209 and a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls the display on a display device such as a display 210. The display device may be a liquid crystal display or other display device. These are used by the administrator as necessary.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 controls access to an external memory 211 such as an external storage device (hard disk (HD)) that stores boot programs, various applications, font data, user files, edited files, various data, etc., a flexible disk (FD), or a compact flash (registered trademark) memory connected via an adapter to a PCMCIA (Personal Computer Memory Card International Association) card slot.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I/F controller 208 connects to and communicates with external devices via a network, and executes communication control processing on the network. For example, communication using TCP/IP (Transmission Control Protocol/Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 The CPU 201 can display on the display 210 by, for example, executing a process of expanding (rasterizing) an outline font in a display information area in the RAM 202. The CPU 201 also enables user instructions using a mouse cursor (not shown) on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。 The various programs described below for implementing the present invention are recorded in the external memory 211 and are executed by the CPU 201 by being loaded into the RAM 202 as required.

次に図３から図６を用いて検索の概要を説明する。図３は、本発明の実施形態に係わる検索条件の入力と検索結果から正解を指定するユーザインタフェースの一例である。 Next, an overview of searching will be described with reference to Figures 3 to 6. Figure 3 shows an example of a user interface for inputting search conditions and specifying the correct answer from the search results according to an embodiment of the present invention.

条件検索入力画面３０１は、検索ユーザが質問文欄３１１に「大谷選手の打撃成績」を入力し、検索ボタン３１２を押下することで検索を開始する画面である。 The condition search input screen 301 is a screen where a search user can start a search by entering "Otani's batting record" in the question field 311 and pressing the search button 312.

検索ユーザが最終的に閲覧したい文書データの一覧を文書閲覧画面３０２で説明する。表示領域３２１に、文書のＩＤが”００２８”であること、タイトル、本文などが表示される。 The list of document data that the search user ultimately wants to view is explained on the document viewing screen 302. The display area 321 displays the document ID "0028", the title, the text, etc.

また、検索ユーザが表示領域３２１に表示された文書を閲覧して、まさに自身が見たかった情報であると判断すれば、正解ボタン３２２を押下することで、学習データ記憶部１２２に登録させることができる。 In addition, if the search user views the document displayed in the display area 321 and determines that it is exactly the information he or she wanted to see, the user can press the correct answer button 322 to have it registered in the learning data storage unit 122.

しかし、文書ＩＤ”００２８”にたどり着くためには、まず図４の検索結果一覧４００が提示される。一般的に前記検索結果一覧４００が検索ユーザに提示されることが多いが、一部の商用システムには自動分類して、検索ユーザに提示する文書群を限定することがある。これは登録されている文書の数にもよるが、検索条件に対して数百件がヒットした場合に、検索ユーザが求める情報にたどり着くのは大変であり、何らかの方法で、例えば検索ユーザに再度何かの条件を指定させることで数十件に限定することで求める情報にたどり着きやすくするためである。 However, to reach document ID "0028", the search result list 400 in Figure 4 is first presented. Generally, the search result list 400 is often presented to the search user, but some commercial systems may automatically classify and limit the set of documents presented to the search user. This depends on the number of registered documents, but when hundreds of hits are generated for the search criteria, it can be difficult for the search user to reach the information they are looking for, so by having the search user specify some criteria again, for example, limiting the results to a few dozen makes it easier to reach the information they are looking for.

一つの方法として、検索条件にヒットした文書群をクラスタリングして、内容が類似する文書群をクラスタと呼ばれるグループとして提示する。検索ユーザはまず自身が得たい情報のクラスタを選択することで前述の通り確認する文書を一部に限定することになる。 One method is to cluster the documents that match the search criteria and present documents with similar content as groups called clusters. The search user first selects the cluster that contains the information they want to obtain, thereby limiting the documents they check to a portion of the total, as described above.

図４の検索結果一覧には、例えば大谷一朗選手に関する情報であっても、高校時代、日本のプロ野球で活躍していた時代、さらにメジャーリーグでの成績などのクラスタに分かれると思われる。図５では、これらの分類がクラスタ一覧の例５０１として検索ユーザに提示される。 In the search result list in Figure 4, for example, information about Ichiro Otani would likely be divided into clusters for his high school years, his time in professional baseball in Japan, and his performance in the major leagues. In Figure 5, these classifications are presented to the search user as an example cluster list 501.

検索ユーザがクラスタ５０３の”メジャーリーグ”を選択すると、そのクラスタに分類された文書一覧（図６の６００）が表示される。図４の検索結果一覧４００では大谷一朗選手について様々な情報が含まれていたが、図６の文書一覧６００ではメジャーリーグでの活躍に関する記事だけが含まれるという例を示している。このように文書一覧を制限することで、検索ユーザが求める情報（図３の文書ＩＤ”００２８”）に容易にたどり着くことを可能とする。 When a search user selects "Major League Baseball" from cluster 503, a list of documents classified into that cluster (600 in FIG. 6) is displayed. While the search result list 400 in FIG. 4 contained a variety of information about player Ichiro Otani, document list 600 in FIG. 6 shows an example in which only articles about his achievements in Major League Baseball are included. Limiting the document list in this way makes it possible for the search user to easily arrive at the information they are looking for (document ID "0028" in FIG. 3).

また、クラスタリングは検索ユーザが指定する質問文（図３の３１１）にヒットする文書群が異なれば当然異なるグループに分類される。図５の５０２のように”アメリカでの野球”という質問文で検索すれば、大谷一朗選手に関する情報が集まるとは限らず、図４の例とは異なる文書がヒットする。さらにそれらの文書をクラスタリングすれば異なるクラスタが生成される。５０２の例では、メジャーリーグというクラスタ（５０４）が生成されているが、これは見出しは同じでも５０３とは全く異なる文書が多く含まれることになる。例えば、アメリカの野球の中でのメジャーリーグの位置づけ、大谷一朗選手以外の名選手の記録、などが含まれているなど考えられる。このように同じタイトルが付いていても”メジャーリーグ”というタイトルにはあまり意味はなく、それよりも同じクラスタに含まれる文書ＩＤにはどのようなものがあるか、といったリストがこれらのクラスタの言語的特徴を表すものとして重要になるのである。 In addition, when clustering, if the documents that match the query (311 in Figure 3) specified by the search user are different, they will naturally be classified into different groups. If you search for the query "baseball in America" as in 502 in Figure 5, you will not necessarily find information about Ichiro Otani, but will find documents that are different from the example in Figure 4. If you then cluster these documents, different clusters will be generated. In the example of 502, a cluster called Major League (504) is generated, but even if the heading is the same, it will contain many documents that are completely different from 503. For example, it may contain information about the position of Major League Baseball in American baseball, records of famous players other than Ichiro Otani, etc. In this way, even if the documents have the same title, the title "Major League" does not have much meaning, and it is more important to have a list of document IDs that are included in the same cluster as a representation of the linguistic characteristics of these clusters.

図７は、本発明の実施形態に係る学習データの構造の一例である。順位学習において学習モデルを生成するための学習データは、例えば実際にユーザが検索した際に、ユーザの検索意図に一致した文書を指定することで得られるものである。従って、少なくともユーザの検索条件と、ユーザが選択した文書を特定するための情報がペアで登録される必要がある。学習データ７０１においては、それらは質問文７０３、正解文書ＩＤ７０４として格納される。 Figure 7 shows an example of the structure of learning data according to an embodiment of the present invention. Learning data for generating a learning model in ranking learning is obtained, for example, by specifying a document that matches the user's search intent when the user actually searches. Therefore, at least the user's search conditions and information for identifying the document selected by the user must be registered in pairs. In the learning data 701, these are stored as a question sentence 703 and a correct answer document ID 704.

本発明における学習データ記憶部１２２に格納された学習データ７０１（ａ～ｆ）は、前記質問文７０３、正解文書ＩＤ７０４以外に、同一クラスタ文書ＩＤリスト７０５が格納されることを特徴とする。 The learning data 701 (a to f) stored in the learning data storage unit 122 in the present invention is characterized in that, in addition to the question sentence 703 and the correct answer document ID 704, a same cluster document ID list 705 is stored.

例えば、「大谷選手の打撃成績」という検索条件を入力した検索ユーザが、クラスタ５０３の「メジャーリーグ」を選択したことを図６で示している。当該クラスタは、文書ＩＤとして”０００５”、”０００６”・・・、”００２８”を含んでいるため、学習データ７０１ａの同一クラスタ文書ＩＤリスト７０５にはこれらの文書ＩＤのリストがそのまま記載されている。 For example, Figure 6 shows that a search user who entered the search criteria "Otani's batting record" selected "Major League" from cluster 503. This cluster contains document IDs "0005", "0006", ..., "0028", so the same cluster document ID list 705 in the training data 701a lists these document IDs as is.

次に図８～図９を用いて、学習モデル生成の処理について説明する。図８は、本発明の実施形態に係る学習モデル生成の処理を説明するフローチャートの一例である。図８のフローチャートの各ステップは、情報処理装置１００上のＣＰＵ２０１で実行される。
＜実施形態１＞
ステップＳ８０１では、学習データ記憶部１２２に記憶された全ての学習データ群を読み込む。読み込んだデータを図９の学習データの文書ＩＤベクトル（例１）９０１に示す。このテーブルは図７と本質的に同じものであるが、文書ＩＤリストを一覧として表現した図７に対して、（例えば文書ＩＤが１～５００まであるとすれば）それぞれの文書ＩＤを１列にならべたベクトルとして表現している。ベクトルの各要素は、同一クラスタ文書ＩＤリスト７０５にその文書ＩＤが含まれる場合は”１”、含まれない場合は”０”としたものである。 Next, the process of generating a learning model will be described with reference to Fig. 8 and Fig. 9. Fig. 8 is an example of a flowchart for explaining the process of generating a learning model according to an embodiment of the present invention. Each step of the flowchart in Fig. 8 is executed by the CPU 201 on the information processing device 100.
<Embodiment 1>
In step S801, all learning data groups stored in the learning data storage unit 122 are read. The read data is shown in learning data document ID vector (example 1) 901 in Fig. 9. This table is essentially the same as Fig. 7, but whereas the document ID list in Fig. 7 is expressed as a list, this table expresses each document ID (for example, if document IDs range from 1 to 500) as a vector arranged in a row. Each element of the vector is set to "1" if the document ID is included in the same cluster document ID list 705, and "0" if it is not included.

ステップＳ８０２では、学習データ（９０１）の中で、正解文書ＩＤ（９１１）が同一のものを図９の学習データの文書ＩＤベクトル（例２）９０２のようにまとめる。例えば、正解文書ＩＤが”００２８”であるものは”Ｌ０００１”、”Ｌ０００３”の２つある。これらを９０２の１行目のようにまとめる。具体的には、学習データＩＤリスト９１６に、これら２つの学習データＩＤを列挙し、また９０１の文書ＩＤベクトルは、各ＩＤにあり（１）、なし（０）だけを示していたものを、合計で幾つあったかを表すようにする。例えば、文書ＩＤが”０００５”に相当する値は、９１４では”１”だが、９１７では”２”となっている。このように単純に合計するのはあくまで例であって様々な計算方法があることはいうまでもない。 In step S802, among the training data (901), those with the same correct answer document ID (911) are grouped together as in the training data document ID vector (example 2) 902 in FIG. 9. For example, there are two correct answer document IDs for "0028", "L0001" and "L0003". These are grouped together as in the first line of 902. Specifically, these two training data IDs are listed in the training data ID list 916, and the document ID vector of 901 indicates how many IDs were present (1) or absent (0) in total. For example, the value corresponding to the document ID "0005" is "1" in 914, but "2" in 917. It goes without saying that this simple sum is merely an example, and there are various calculation methods.

ステップＳ８０３では、９０２の学習データをグループ化する。目的は、１つの学習モデルを生成する際に使用する学習データを決定することである。すなわち、文書内に何らかの分類情報が入っている場合には、正解文書ＩＤで示す文書内の分類情報が同一のものを集めて学習データをグループ化するなどが可能であるが、本願発明ではそのような分類情報を持たない、あるいは使用できない場合を想定しているため、図６のようにドリルダウンした際に、同じような文書群が含まれている学習データは、１つの学習モデルを生成するために使用するものと仮定している。 In step S803, the learning data 902 is grouped. The purpose is to determine the learning data to be used when generating one learning model. In other words, if the document contains some classification information, it is possible to group the learning data by collecting documents with the same classification information indicated by the correct answer document ID. However, the present invention assumes a case where such classification information is not available or cannot be used, and therefore assumes that the learning data that contains similar groups of documents when drilled down as in FIG. 6 is used to generate one learning model.

なお、ここで学習モデルをグループ化する方法としては、周知の技術としてベクトルのクラスタリングがある。図９の例２では、文書数に相当する５００次元のベクトルを相互に比較し、クラスタリングする技術である。また重複クラスタリングとして、同一のベクトルが複数のクラスタに含まれることを許容する技術もある。いずれにしてもこれらのベクトル群をクラスタに分ける技術であれば、どのような方式であっても良いことはいうまでもない。 As a method for grouping learning models, vector clustering is a well-known technique. In Example 2 of Figure 9, 500-dimensional vectors, which corresponds to the number of documents, are compared and clustered. There is also a technique known as overlap clustering, which allows the same vector to be included in multiple clusters. In any case, it goes without saying that any method that can divide these vector groups into clusters is acceptable.

これにより、前記の通り予め文書群に分類情報がない場合であっても、学習データを適切なグループに分けて複数の学習データを生成することが可能になる、という効果を得ることができる。グループ化した学習モデル群を図９のグループ９０３（ａ、ｂ）として例示する。 As a result, even if there is no classification information in advance for the document group as described above, it is possible to obtain the effect of dividing the learning data into appropriate groups and generating multiple pieces of learning data. An example of the grouped learning models is shown as group 903 (a, b) in Figure 9.

Ｓ８０４からＳ８０６の繰り返し処理は、Ｓ８０３でグループ化した学習データ群１つ１つに対する処理である。 The repeated processing from S804 to S806 is performed on each set of learning data grouped in S803.

Ｓ８０５は、グループ化した学習データ群（例えば図９の９０３ａ、９０３ｂ）の１つずつに着目し、当該学習データで学習モデルの生成を行う。順位学習の場合ＳＶＭ（サポートベクターマシン）などにより実現することが可能である。生成された学習モデルは、学習モデル記憶部１２３に格納する。 S805 focuses on each of the grouped learning data (e.g., 903a and 903b in FIG. 9 ) and generates a learning model from the learning data. In the case of ranking learning, this can be realized by an SVM (support vector machine) or the like. The generated learning model is stored in the learning model storage unit 123.

以上で図８のフローチャートによる本願発明における学習モデルの生成についての説明を完了する。 This completes the explanation of how to generate a learning model in the present invention using the flowchart in Figure 8.

図１０は、本発明の実施形態に係る学習モデルを記憶する際のデータ構造と検索時に選択されたクラスタの類似度計算を説明する図の一例である。まず学習モデル記憶部１２３に格納された学習モデルについて説明する。本図では２つの学習モデルが格納されているものとする。 Figure 10 is an example of a diagram illustrating the data structure when storing a learning model according to an embodiment of the present invention, and the similarity calculation of a cluster selected during a search. First, the learning models stored in the learning model storage unit 123 will be described. In this figure, it is assumed that two learning models are stored.

ＳＶＭ（サポートベクターマシン）などで生成された学習モデルの本体は、学習モデル記憶部１２３に格納されている。しかしながら、従来技術では、これらの学習モデルをどのような条件の下で利用するかという情報は含まれておらず、学習モデルを利用するアプリケーション（あるいはユーザ）が、複数ある学習モデルから使用すべきものを選択することになる。しかし、本願発明の前提として、文書群に分類情報に相当する情報が固定的に用意されておらず、また図５のクラスタも動的に生成されるため、どの学習モデルを利用すべきかは検索時に決定するしかない。 The learning model body generated by an SVM (support vector machine) or the like is stored in the learning model storage unit 123. However, in conventional technology, information about the conditions under which these learning models are used is not included, and the application (or user) that uses the learning models selects which learning model to use from among multiple available models. However, as a premise of the present invention, fixed information equivalent to classification information is not prepared for document groups, and the clusters in FIG. 5 are also dynamically generated, so which learning model to use must be determined at the time of search.

本願発明の特徴は、各々の学習モデルに関連づけて、その学習モデルがいかなる状況で使用されるかを示す学習モデルの言語的特徴１００３を含むことにある。 A feature of the present invention is that it includes linguistic features 1003 of the learning model that are associated with each learning model and indicate in what context the learning model is to be used.

言語的特徴１００３の設定データ１００５の一例として、文書ＩＤベクトル（総和）で表した場合を示す。文書ＩＤベクトル（総和）１００６は、図９のグループ９０３ａに含まれる学習データの文書ＩＤベクトルを単純に総和したものである。つまり、この学習モデルを生成した際に用いた学習データでは、正解文書と同じクラスタに、どのような文書がどの程度出現したか、という傾向が記載されていることになる。類似検索や文書のクラスタリングでは、これは１種の言語的特徴を示すものであり、本願発明での当該学習モデルの言語的特徴である。 As an example of setting data 1005 for linguistic features 1003, the following shows a case where it is expressed as a document ID vector (sum). Document ID vector (sum) 1006 is a simple sum of the document ID vectors of the training data included in group 903a in FIG. 9. In other words, the training data used when generating this training model describes the tendency of what kind of documents and how often they appeared in the same cluster as the correct answer document. In similarity search and document clustering, this represents a kind of linguistic feature, and is the linguistic feature of the training model in this invention.

図１１は、本発明の実施形態に係る検索処理を説明するフローチャートの一例である。図１１のフローチャートの各ステップは、情報処理装置１００上のＣＰＵ２０１で実行される。 FIG. 11 is an example of a flowchart illustrating a search process according to an embodiment of the present invention. Each step of the flowchart in FIG. 11 is executed by the CPU 201 on the information processing device 100.

ステップＳ１１０１では、検索ユーザあるいはアプリケーションから検索条件を受け付け、ステップＳ１１０２では、文書記憶部１２１から当該検索条件にヒットする文書群を取得する。 In step S1101, search conditions are accepted from a search user or an application, and in step S1102, a group of documents that match the search conditions are obtained from the document storage unit 121.

ステップＳ１１０３では、ステップＳ１１０２で取得した文書群をクラスタリングする。この例が図５の５０１で、例えば３つのクラスタとなっている。 In step S1103, the documents acquired in step S1102 are clustered. An example of this is shown in FIG. 5, with three clusters.

ステップＳ１１０４では、前記クラスタの一覧をユーザに提示し、ステップＳ１１０５では、提示されたクラスタの中からユーザが１つを選択する。すなわちドリルダウンする。検索ユーザは図５の５０１の中から"メジャーリーグ”というクラスタを選択したとする。 In step S1104, the list of clusters is presented to the user, and in step S1105, the user selects one of the presented clusters, i.e., drills down. Let us assume that the search user selects the cluster "Major League Baseball" from 501 in FIG. 5.

ステップＳ１１０６では、前記ユーザが選択したクラスタ（例では”メジャーリーグ”）に含まれる文書ＩＤリストをベクトルとして生成する。図１０の１００７（検索時に自動生成されたクラスタに含まれる文書ＩＤベクトル）が生成されたものである。次に１００１ａ～１００１ｂの文書ＩＤベクトル（総和）１００６の中で、１００７のベクトルと類似度が一番高いものを特定する。 In step S1106, a list of document IDs contained in the cluster selected by the user ("Major League Baseball" in this example) is generated as a vector. This is the document ID vector 1007 in FIG. 10 (document ID vector contained in the cluster automatically generated during search). Next, of the document ID vectors (sum) 1006 of 1001a to 1001b, the one with the highest similarity to vector 1007 is identified.

しかしながら、一番類似度が高いものでも、学習モデルとして採用するのが不適切な場合がある。そこで、ステップＳ１１０７では、不図示の記憶部に記憶された閾値と比較し、ベクトルの類似度が閾値を超える学習モデルがない場合（ＮＯの場合）は学習モデルを用いずに、ステップＳ１１０２の類似検索結果のランキングをそのままユーザに提示するようにしても良い。適切な学習モデルがある場合には、ステップＳ１１０８において、ステップＳ１１０２の類似検索の結果を再ランク付けし、ステップＳ１１０９で検索結果としてユーザに提示する。 However, even if the similarity is the highest, it may be inappropriate to use it as a learning model. Therefore, in step S1107, a comparison is made with a threshold value stored in a memory unit (not shown), and if there is no learning model whose vector similarity exceeds the threshold value (if NO), the ranking of the similarity search results in step S1102 may be presented to the user as is without using a learning model. If there is an appropriate learning model, in step S1108, the results of the similarity search in step S1102 are re-ranked, and in step S1109, they are presented to the user as search results.

最後にステップＳ１１１０では、ユーザが提示された検索結果の中から１つの文書を、ユーザ自身の検索に対して適切な文書であった、と指定した場合にはそれを正解選択として受付け、ステップＳ１１１１にて新たな学習データとして学習データ記憶部１２２に登録する。この学習データは次回の学習モデル生成時に使われることになる。 Finally, in step S1110, if the user designates one document from the presented search results as being appropriate for the user's own search, that document is accepted as the correct selection, and in step S1111, it is registered as new learning data in the learning data storage unit 122. This learning data will be used the next time a learning model is generated.

以上で、図１１のフローチャートを用いて、クラスタリング及びドリルダウンから、適切な学習モデルを選択して、クラスタ内に出現した文書群を再ランク付けしてユーザに提示する処理についての説明を完了する。
＜実施形態２＞
他の実施形態について説明する。図８のフローチャートにおいては、ステップＳ８０２で同一の正解文書ＩＤをもつ学習データを１つにまとめたが、この処理を実施しなくても良い。 This completes the explanation of the process of selecting an appropriate learning model from clustering and drill-down, re-ranking documents that appear in a cluster, and presenting the re-ranked documents to the user using the flowchart in FIG.
<Embodiment 2>
Another embodiment will be described below. In the flowchart of Fig. 8, learning data having the same correct answer document ID are grouped together in step S802, but this process does not have to be performed.

その場合には、同じ正解文書ＩＤを持つ学習データが異なる学習データのグループに含まれるようになる。正解となる文書が同一のものであっても、そもそものユーザの検索意図が異なれば、検索条件の言語的特徴も異なり、同一クラスタに含まれる他の文書ＩＤも全く異なる可能性もある。このような学習データを無理に１つにまとめて同一の学習データを生成するために用いる必要はなく、異なる学習モデルを生成するために用いることで、よりユーザの意図を反映した学習モデルが生成可能になるという効果を得ることができる。
＜実施形態３＞
図１０の例では学習モデルの言語的特徴１００３の設定データ１００５を文書ＩＤベクトル（総和）１００６としているが、クラスタの言語的特徴を表すものであれば、いかなるものでもよいのはいうまでもない。 In that case, training data having the same correct answer document ID will be included in different training data groups. Even if the correct answer document is the same, if the user's original search intent is different, the linguistic features of the search conditions will be different, and other document IDs included in the same cluster may be completely different. There is no need to forcibly group such training data together and use them to generate the same training data. By using them to generate different training models, it is possible to obtain the effect of generating training models that better reflect the user's intent.
<Embodiment 3>
In the example of FIG. 10, the setting data 1005 of the linguistic features 1003 of the learning model is the document ID vector (sum) 1006, but it goes without saying that anything that represents the linguistic features of the cluster may be used.

例えば、対応する学習モデルを生成するために使用した学習モデルの”質問文”と、正解となる文書内のテキストから、特徴語（重要語など）を自然言語処理により取り出して、１００５に格納しても良い。 For example, feature words (such as important words) may be extracted by natural language processing from the “question sentence” of the learning model used to generate the corresponding learning model and the text in the correct answer document, and stored in 1005.

この場合、検索時にも検索ユーザが選択したクラスタに含まれる文書から特徴語を抽出して、１００５と比較しても良い。 In this case, characteristic words may be extracted from documents contained in the cluster selected by the search user during the search and compared with 1005.

また、単語そのものではなくても良い。周知の技術のモデルがある。”ＴｏｍａｓＭｉｋｏｌｏｖ，ＫａｉＣｈｅｎ，ａｎｄＪｅｆｆｒｅｙＤｅａｎ，Ｅｆｆｉｃｉｅｎｔｅｓｔｉｍａｔｉｏｎｏｆｗｏｒｄｒｅｐｒｅｓｅｎｔａｔｉｏｎｉｎｖｅｃｔｏｒｓｐａｃｅ，ＣｏＲＲ，Ｖｏｌ．ａｂｓ／１３０１．３７８１，，２０１３”
この技術では、大量の文書内に出現する単語を例えば２００次元の素性ベクトルとして表すように学習する。さらに文書の内容は、それら素性ベクトルの和として考えることができる。従って、学習モデルの生成に関与した質問文やクラスタに含まれた文書群の特徴を素性ベクトルとして表し、また検索時には、ユーザが選択したクラスタに含まれる文書群から素性ベクトルを生成して、類似度を比較することも可能である。 Moreover, it does not have to be the word itself. There is a model of a well-known technology. "Tomas Mikolov, Kai Chen, and Jeffrey Dean, Efficient estimation of word representation in vector space, CoRR, Vol. abs/1301.3781,, 2013"
This technology learns to represent words that appear in a large number of documents as feature vectors of, for example, 200 dimensions. Furthermore, the contents of a document can be considered as the sum of these feature vectors. Therefore, the characteristics of the questions involved in generating the learning model and the documents contained in the clusters can be represented as feature vectors, and during a search, feature vectors can be generated from the documents contained in a cluster selected by the user and compared for similarity.

これにより、単なる文書ＩＤで構成される数値のベクトルの類似度や、特徴語の類似度だけではなく、より意味的に類似した学習モデルを選択することが可能になるという効果が得られる。 This has the effect of making it possible to select learning models that are more semantically similar, rather than simply using the similarity of a numerical vector composed of document IDs or the similarity of feature words.

クラスタに含まれる文書一覧を最適に再ランク付けするための学習モデルを選択するための方法であれば、いかなる素性を利用しても良いことはいうまでもない。 It goes without saying that any feature can be used as long as it is a method for selecting a learning model that optimally reranks the list of documents contained in a cluster.

また、本実施例では文書を対象としたが、データとして検索、分類、評価が可能な画像等、様々な種類のデータにも適用可能である。 In addition, while this embodiment focuses on documents, it can also be applied to various types of data, such as images that can be searched, categorized, and evaluated as data.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It goes without saying that the structure and content of the various data described above are not limited to those described above, and may be structured and configured in various ways depending on the application and purpose.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although several embodiments have been described above, the present invention can be embodied, for example, as a system, device, method, computer program, or recording medium, and specifically, may be applied to a system made up of multiple devices, or may be applied to a device made up of a single device.

また、本発明におけるコンピュータプログラムは、図８、図１１に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図８、図１１の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図８、図１１の各装置の処理方法ごとのコンピュータプログラムであってもよい。 The computer program of the present invention is a computer program that enables a computer to execute the processing methods of the flowcharts shown in Figures 8 and 11, and the storage medium of the present invention stores a computer program that enables a computer to execute the processing methods of Figures 8 and 11. Note that the computer program of the present invention may be a computer program for each processing method of each device in Figures 8 and 11.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, it goes without saying that the object of the present invention can be achieved by supplying a recording medium on which a computer program that realizes the functions of the above-mentioned embodiments is recorded to a system or device, and having the computer (or CPU or MPU) of that system or device read and execute the computer program stored on the recording medium.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program read from the recording medium itself realizes the novel functions of the present invention, and the recording medium on which the computer program is stored constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 Recording media for supplying computer programs include, for example, flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, magnetic tapes, non-volatile memory cards, ROMs, EEPROMs, silicon disks, solid-state drives, etc.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, it goes without saying that the functions of the above-mentioned embodiments are not only realized by the computer executing a computer program that has been read out, but also includes cases where an operating system (OS) running on the computer performs some or all of the actual processing based on the instructions of the computer program, thereby realizing the functions of the above-mentioned embodiments.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, it goes without saying that this also includes cases where a computer program read from a recording medium is written into a memory provided on an expansion board inserted into a computer or a function expansion unit connected to a computer, and then a CPU or the like provided on the expansion board or function expansion unit performs some or all of the actual processing based on the instructions of the computer program code, thereby realizing the functions of the above-mentioned embodiments.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 The present invention may be applied to a system consisting of multiple devices, or to an apparatus consisting of a single device. Needless to say, the present invention can also be applied to cases where the invention is achieved by supplying a computer program to a system or apparatus. In this case, the system or apparatus can enjoy the effects of the present invention by reading a recording medium containing a computer program for achieving the present invention into the system or apparatus.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Furthermore, by downloading and reading a computer program for achieving the present invention from a server, database, etc. on a network using a communication program, the system or device can enjoy the effects of the present invention.

なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 In addition, the present invention also includes configurations that combine the above-mentioned embodiments and their modified examples.

１００情報処理装置
１０１検索条件受付部
１０２類似検索部
１０３クラスタリング部
１０４表示部
１０５学習モデル選択部
１０６再ランク付け部
１０７学習データ登録部
１０８学習モデル生成部
１０９生成モデル決定部
１２１文書記憶部
１２２学習データ記憶部
１２３学習モデル記憶部 REFERENCE SIGNS LIST 100 Information processing device 101 Search condition acceptance unit 102 Similarity search unit 103 Clustering unit 104 Display unit 105 Learning model selection unit 106 Re-ranking unit 107 Learning data registration unit 108 Learning model generation unit 109 Generation model determination unit 121 Document storage unit 122 Learning data storage unit 123 Learning model storage unit

Claims

An information processing device for managing a learning model,
A classification means for classifying data used for learning into groups;
A determination means for determining, based on the data classified into the groups, characteristic data indicative of the data classified into the groups;
a combining means for combining the groups to create a new group based on feature data indicating the data classified into the groups;
a registration means for registering a learning model generated using the data classified into the combined group and feature data related to the determined group;
An information processing device comprising:

The information processing device according to claim 1, characterized in that the characteristic data is data that characterizes the group.

The information processing device according to claim 1 or 2, characterized in that the feature data is used when the group is selected from a plurality of group candidates.

The information processing device according to any one of claims 1 to 3, characterized in that the learning model is a ranking learning model that ranks the data.

The information processing device according to any one of claims 1 to 4, characterized in that the feature data is data indicating an inclusion relationship between the group and data classified into the group.

5. The information processing apparatus according to claim 1, wherein the feature data is a vector of values indicating whether each piece of data has been classified into the group.

The information processing device according to any one of claims 1 to 6, characterized in that the data is document data searched for by search text.

A method for controlling an information processing device that manages a learning model, comprising:
A classification step in which classification means classifies data used for learning into groups;
A determination step in which a determination means determines, based on the data classified into the groups, feature data indicative of the data classified into the groups;
a combining step in which combining means combines the groups based on feature data indicating the data classified into the groups to create a new group;
a registration step in which a registration means registers a learning model generated using the data classified into the combined group and feature data related to the determined group;
13. A method for controlling an information processing device, comprising:

A program executable by an information processing device that manages a learning model,
The information processing device,
A classification means for classifying data used for learning into groups;
A determination means for determining, based on the data classified into the groups, characteristic data indicative of the data classified into the groups;
a combining means for combining the groups to create a new group based on feature data indicating the data classified into the groups;
a registration means for registering a learning model generated using the data classified into the combined group and feature data related to the determined group;
A program to function as a