JP2020071678A

JP2020071678A - Information processing device, control method, and program

Info

Publication number: JP2020071678A
Application number: JP2018205385A
Authority: JP
Inventors: 大樹三浦; Hiroki Miura; 下郡山　敬己; Itsuki Shimokooriyama; 敬己下郡山
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-07
Anticipated expiration: 2038-10-31
Also published as: JP7256357B2

Abstract

To provide a technique capable of enhancing the accuracy improving effect by effectively using learning data and also shortening the learning time, in order learning in information search.SOLUTION: An information processing device includes: search means for searching for a search object document with a search query text; and storage means for storing the search object document and a learning search query text associated with the search object document. The device includes: creation means for creating additional text information for the search object document using the learning search query text associated with the search object document; and learning means for performing order learning using a set of the learning search query text and document data including additional text information for the search object document.SELECTED DRAWING: Figure 1

Description

本発明は、検索対象となる文書群の中から指定された検索条件に適切と思われる文書を提示するための文書検索の技術に関する。 The present invention relates to a document search technique for presenting a document considered to be appropriate for a specified search condition from a document group to be searched.

従来からユーザに対して適切な検索結果を提示するため、検索条件と文書群に含まれる各文書に含まれるターム（形態素解析、Ｎ−Ｇｒａｍなど一定の基準で切り出した文字列）の関連性を統計値として算出する技術がある。これらの技術を類似検索などと呼ぶ（以下、当該技術を本発明の説明において、統一的に類似検索と呼ぶこととし、本願発明における後述の順位学習による検索とは区別することにする）。 Conventionally, in order to present an appropriate search result to a user, the relationship between search conditions and terms (character strings cut out by a certain standard such as morphological analysis and N-Gram) included in each document included in a document group is related. There is a technique for calculating the statistical value. These techniques are referred to as similarity search and the like (hereinafter, this technique will be collectively referred to as similarity search in the description of the present invention, and will be distinguished from the search by order learning described later in the present invention).

また、学習データと検索対象となる文書群が類似する場合の特徴量を機械学習によりモデル化し、新たな検索条件が指定された場合に、当該学習モデルに基づきランキング調整をすることで、類似検索の精度を向上させる順位学習の技術がある。 In addition, when the learning data and the documents to be searched are similar, a feature quantity is modeled by machine learning, and when a new search condition is specified, ranking adjustment is performed based on the learning model to perform similar search. There is a rank learning technique that improves the accuracy of.

順位学習には大量の学習データが必要であるが、学習データの収集は困難である。類似検索をシステムとして運用開始した後にユーザの検索ログから学習データを収集することも考えられるが、検索結果の評価にはユーザの負荷がかかることもあり、十分な量のログ収集が可能とは言い切れない。また運用開始前には、開発者がテスト用に作成した学習データなどに限定される。 A large amount of learning data is required for rank learning, but it is difficult to collect learning data. It is possible to collect learning data from the user's search log after the similar search is started as a system, but evaluation of the search results may impose a load on the user, and it is not possible to collect a sufficient amount of logs. could not say it all. Before starting operation, it will be limited to learning data created by the developer for testing.

特許文献１は、予め用意された回答（いわばＦＡＱの文書群）に対して、ユーザからの問い合わせに対して最も類似した質問（学習データの質問文）を見つけ、対応する回答を返す技術に対して、質問文が少ない場合でもトピック推定精度を高める技術を提供している。 Patent Document 1 is directed to a technique for finding a question (question sentence of learning data) that is most similar to an inquiry from a user in response to a prepared answer (so-called FAQ document group) and returning a corresponding answer. Thus, even if the number of question sentences is small, a technology for improving topic estimation accuracy is provided.

具体的には、学習データの質問文に現れる単語に対して、対応する回答内の単語に置換することによって、学習データの質問文を拡張する、すなわち学習データの件数を増やしている。また拡充した質問文のうち不自然な質問文を除外するため、確率言語モデルを用いて質問文の存在確率を計算し、存在確率がある閾値を超える場合のみ学習データとして用いるとしている。 Specifically, by replacing a word appearing in the question sentence of the learning data with a word in the corresponding answer, the question sentence of the learning data is expanded, that is, the number of learning data items is increased. In order to exclude unnatural question sentences from the expanded question sentences, the probabilistic language model is used to calculate the existence probability of the question sentence, and only when the existence probability exceeds a certain threshold, it is used as learning data.

特開２０１７−３７５８８号公報JP, 2017-37588, A

しかしながら、特許文献１の技術においては、確率言語モデルを用いて拡充された質問文が適切であるか否かを判定しているが、置換された単語はあくまで予め用意された回答に含まれるものであり、専門用語やある組織特有の用語が使用されている可能性がある。その場合、確率言語モデルでは事例が不足していて、質問文が適切に拡充されない場合も発生する。 However, in the technique of Patent Document 1, it is determined using a probabilistic language model whether or not the expanded question sentence is appropriate, but the replaced word is included in the answer prepared in advance. It is possible that technical terms or term specific to an organization are used. In that case, there are cases where the probabilistic language model is insufficient and the question sentences are not appropriately expanded.

また質問文の拡充が適切に処理されたとしても、そのような専門用語、組織特有の用語を、実際のユーザが入力しない場合もある。さらに当該技術自体、質問文を増幅させることが目的であり、その結果、学習処理に要する時間が指数関数的に増加してしまう可能性もある。
本発明の目的は、情報検索における順位学習において、学習データを効果的に使うことで精度向上の効果を高め、また学習時間が短縮を可能とする技術を提供することである。 Further, even if the expansion of the question text is properly processed, the technical user or the organization-specific term may not be input by the actual user. Further, the technique itself has the purpose of amplifying the question sentence, and as a result, the time required for the learning process may increase exponentially.
It is an object of the present invention to provide a technique capable of enhancing the accuracy improvement effect and effectively shortening the learning time by effectively using the learning data in order learning in information retrieval.

本発明は、検索クエリテキストにより検索対象文書を検索する検索手段と、前記検索対象文書と当該検索対象文書に対応付けられた学習用検索クエリテキストとを記憶する記憶手段とを備える情報処理装置であって、前記検索対象文書に対応付けられた学習用検索クエリテキストを用いて、前記検索対象文書に対する付加テキスト情報を作成する作成手段と、前記学習用検索クエリテキストと、前記検索対象文書に対する付加テキスト情報を含む文書データとの組を用いて順位学習を行う学習手段とを備えることを特徴とする。 The present invention is an information processing apparatus comprising: a search unit that searches a search target document by a search query text; and a storage unit that stores the search target document and a learning search query text associated with the search target document. Then, using the learning search query text associated with the search target document, creating means for creating additional text information for the search target document, the learning search query text, and the addition to the search target document And a learning means for performing order learning using a set with document data including text information.

本発明により、本発明の目的は、情報検索における順位学習において、学習データを効果的に使うことで精度向上の効果を高め、また学習時間が短縮を可能とする技術を提供することが可能となる。 According to the present invention, an object of the present invention is to provide a technology capable of enhancing the effect of improving accuracy by effectively using learning data in order learning in information retrieval and shortening the learning time. Become.

本発明の実施形態に係る機能構成の一例を示す図である。It is a figure showing an example of functional composition concerning an embodiment of the present invention. 本発明の実施形態に係る情報処理装置１００に適用可能なハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions applicable to the information processing apparatus 100 which concerns on embodiment of this invention. 本発明の実施形態に係わる類似検索対象となる文書の一例である。3 is an example of a document to be a similarity search target according to the embodiment of the present invention. 本発明の実施形態に係わる学習データの一例である。3 is an example of learning data according to the embodiment of the present invention. 本発明の実施形態に係る生成された素性ベクトルの一例である。It is an example of a generated feature vector according to the embodiment of the present invention. 本発明の実施形態に係わる学習データから抽出された特徴語の一例である。It is an example of the characteristic word extracted from the learning data concerning the embodiment of the present invention. 本発明の実施形態に係る生成された素性ベクトルの一例である。It is an example of a generated feature vector according to the embodiment of the present invention. 本発明の実施形態に係る学習データ件数の分布を示すグラフの一例である。It is an example of a graph showing a distribution of the number of learning data according to the embodiment of the present invention. 本発明の実施形態に係る学習時の処理を説明するフローチャートの一例である。It is an example of a flowchart illustrating a process at the time of learning according to the embodiment of the present invention. 本発明の実施形態に係る学習結果に基づく類似検索・再ランク付けの処理を説明するフローチャートの一例である。6 is an example of a flowchart illustrating a process of similarity search / reranking based on a learning result according to the embodiment of the present invention. 本発明の実施形態に係る設定項目の一例である。3 is an example of a setting item according to the embodiment of the present invention. 本発明の実施形態に係る学習言語情報の格納方法の一例を示す図である。It is a figure which shows an example of the storage method of the learning language information which concerns on embodiment of this invention.

以下、本発明の実施の形態を、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明においては機械学習により従来型の文書の検索結果を、機械学習を利用して検索順位を改めて指定し直す。これを順位学習などと呼ぶ。特に本発明では説明の便宜上、事前に学習モデルを決定する処理を“学習モデルの生成”、実際にユーザなどの検索条件に基づく検索結果を、生成された学習モデルを用いて順位を指定し直す処理を“再ランク付け”と呼ぶことにする。 In the present invention, the search result of a conventional document is re-designated by machine learning using machine learning. This is called order learning. In particular, in the present invention, for convenience of explanation, the process of determining the learning model in advance is “generation of the learning model”, and the search results based on the search conditions of the user or the like are actually designated again by using the generated learning model. The process will be referred to as "reranking".

図１は、本発明の実施形態に係る機能構成の一例を示す図である。本機能構成は、主に次の３つに分けて考えることができる。まず従来型の類似検索を用いるが、従来型の類似検索に関連する部分は１３１である。また学習データに基づき学習モデルを生成する部分は１０１〜１０４である。生成された学習モデルに基づき、類似検索および再ランク付けを行う部分は１１１〜１１４である。 FIG. 1 is a diagram showing an example of a functional configuration according to an embodiment of the present invention. This functional configuration can be mainly divided into the following three types. First, the conventional similarity search is used, and 131 is the portion related to the conventional similarity search. Further, the portions 101 to 104 generate a learning model based on the learning data. The portions 111 to 114 perform similarity search and re-ranking based on the generated learning model.

学習データ前処理部１０１は、学習データ記憶部１２１に記憶された学習データ（検索条件や正解の選択などのユーザログ）から、実際に学習する学習データを選択、また学習データ群から言語的な特徴を抽出する機能部である。前記選択や抽出を行うための基準は設定記憶部１２２に格納されている。前処理した結果は、学習言語情報記憶部１２３に格納される。 The learning data preprocessing unit 101 selects learning data to be actually learned from the learning data (user logs such as search conditions and selection of correct answers) stored in the learning data storage unit 121, and linguistically from the learning data group. It is a functional unit that extracts features. Criteria for performing the selection and extraction are stored in the setting storage unit 122. The result of the preprocessing is stored in the learning language information storage unit 123.

学習データに記載されたクエリに対しての情報検索部１３１を呼び出し、文書群を格納した検索対象文書記憶部１２４を検索する。情報検索とは本願発明で説明する“順位学習”を用いて高精度化された検索エンジンではないものを想定しているが、他の順位学習、あるいは本願発明の順位学習自体、あるいは如何なる方式の情報検索であってもよい。とにかく検索対象文書記憶部１２４からユーザが所望する文書を適切に検索可能なものであればよい。 The information search unit 131 for the query described in the learning data is called, and the search target document storage unit 124 that stores the document group is searched. Information retrieval is assumed to be not a highly accurate search engine using “rank learning” described in the present invention, but other rank learning, or rank learning itself of the present invention, or any method It may be information retrieval. Anyway, it is only necessary that the search target document storage unit 124 can appropriately search for a document desired by the user.

学習用素性ベクトル生成部１０２においては、前記学習データの１つに着目し、当該学習データにおけるクエリ（検索条件）の検索結果の各文書を比較して、文書毎に言語的特徴を“素性ベクトル”として表す。さらに学習用素性ベクトルマッピング部１０３により前記素性ベクトルを、対応する次元の座標空間にマッピングする。このマッピングに基づき、学習モデル生成部１０４が再ランク付け（順位学習）した結果を学習モデルとして表現し、学習モデル記憶部１２５に当該学習モデルを記憶する。 In the learning feature vector generation unit 102, focusing on one of the learning data, the documents in the search results of the query (search condition) in the learning data are compared, and the linguistic feature is determined as “feature vector” for each document. ". Further, the learning feature vector mapping unit 103 maps the feature vector to the coordinate space of the corresponding dimension. Based on this mapping, the learning model generation unit 104 expresses the result of re-ranking (rank learning) as a learning model, and stores the learning model in the learning model storage unit 125.

次に実際の運用においてはユーザ、あるいは他のアプリケーションが検索条件を入力し、事前に学習された結果の学習モデルを用いて適切な結果を呼び出し側に返す処理の構成を説明する。 Next, in actual operation, a configuration of processing in which a user or another application inputs a search condition and returns an appropriate result to a calling side using a learning model of a result learned in advance will be described.

ユーザ条件受付部１１１は、ユーザ（あるいは他のアプリケーション）の検索条件（クエリ）を受け付ける。その検索条件（クエリ）に基づき、情報検索部１３１が検索対象文書記憶部１２４を検索し、検索結果を返す。再ランク付け用素性ベクトル生成部１１２では、前記検索条件と前記の各検索結果を比較して、素性ベクトルを生成する。着目した１つのクエリと１つの文書から素性ベクトルを生成する処理は学習用素性ベクトル生成部１０２と同じ処理であるが、学習時にはクエリ自体が複数あることや学習理論によっては異なる可能性も考慮し、便宜上２つの機能部に区別している。実際には同一であれ異なるものであれ本願発明に含まれる。 The user condition receiving unit 111 receives a search condition (query) of a user (or another application). The information search unit 131 searches the search target document storage unit 124 based on the search condition (query) and returns the search result. The re-ranking feature vector generation unit 112 compares the search condition with each search result to generate a feature vector. The process of generating a feature vector from one focused query and one document is the same as that of the learning feature vector generation unit 102, but in consideration of the fact that there are multiple queries themselves and the possibility that they differ depending on the learning theory during learning. , For convenience, they are divided into two functional units. In practice, the same or different ones are included in the present invention.

再ランク付け用素性ベクトルマッピング部１１３は前記素性ベクトルを、学習モデル生成部１０４で生成され学習モデル記憶部１２５に格納された座標空間にマッピングする。このマッピングに基づき再ランク付け部１１４が、前述のユーザの検索条件に基づく検索結果に対する各文書の再ランク付け処理を行う。 The re-ranking feature vector mapping unit 113 maps the feature vector to the coordinate space generated by the learning model generation unit 104 and stored in the learning model storage unit 125. Based on this mapping, the re-ranking unit 114 performs a re-ranking process of each document with respect to the search result based on the above-mentioned user's search condition.

図２は、本発明の実施形態に係る情報処理装置１００に適用可能なハードウェア構成の一例を示すブロック図である。 FIG. 2 is a block diagram showing an example of a hardware configuration applicable to the information processing apparatus 100 according to the embodiment of the present invention.

図２に示すように、情報処理装置１００は、システムバス２０４を介してＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０３、入力コントローラ２０５、ビデオコントローラ２０６、メモリコントローラ２０７、通信Ｉ／Ｆコントローラ２０８等が接続された構成を採る。 As illustrated in FIG. 2, the information processing apparatus 100 includes a CPU (Central Processing Unit) 201, a RAM (Random Access Memory) 202, a ROM (Read Only Memory) 203, an input controller 205, and a video controller 206 via a system bus 204. , A memory controller 207, a communication I / F controller 208, etc. are connected.

ＣＰＵ２０１は、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。 The CPU 201 centrally controls each device and controller connected to the system bus 204.

また、ＲＯＭ２０３あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や、各サーバあるいは各ＰＣが実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。また、本発明を実施するために必要な情報が記憶されている。なお外部メモリはデータベースであってもよい。 Further, the ROM 203 or the external memory 211 will be described later, which is necessary for realizing the functions executed by the control programs of the CPU 201, such as BIOS (Basic Input / Output System), OS (Operating System), and each server or each PC. Various programs are stored. In addition, information necessary for carrying out the present invention is stored. The external memory may be a database.

ＲＡＭ２０２は、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０３あるいは外部メモリ２１１からＲＡＭ２０２にロードし、ロードしたプログラムを実行することで各種動作を実現する。 The RAM 202 functions as a main memory, a work area, etc. of the CPU 201. The CPU 201 implements various operations by loading a program or the like required for executing the processing from the ROM 203 or the external memory 211 into the RAM 202 and executing the loaded program.

また、入力コントローラ２０５は、キーボード（ＫＢ）２０９や不図示のマウス等のポインティングデバイス等からの入力を制御する。 The input controller 205 also controls inputs from a keyboard (KB) 209 and a pointing device such as a mouse (not shown).

ビデオコントローラ２０６は、ディスプレイ２１０等の表示器への表示を制御する。尚、表示器は液晶ディスプレイ等の表示器でもよい。これらは、必要に応じて管理者が使用する。 The video controller 206 controls display on a display device such as the display 210. The display may be a display such as a liquid crystal display. These are used by the administrator as needed.

メモリコントローラ２０７は、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶する外部記憶装置（ハードディスク（ＨＤ））や、フレキシブルディスク（ＦＤ）、あるいは、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）カードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The memory controller 207 is an external storage device (hard disk (HD)) that stores a boot program, various applications, font data, user files, edit files, various data, etc., a flexible disk (FD), or a PCMCIA (Personal Computer). Controls access to an external memory 211 such as a CompactFlash (registered trademark) memory connected to a Memory Card International Association (Card) card slot via an adapter.

通信Ｉ／Ｆコントローラ２０８は、ネットワークを介して外部機器と接続・通信し、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ／ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）を用いた通信等が可能である。 The communication I / F controller 208 connects / communicates with an external device via a network and executes communication control processing on the network. For example, communication using TCP / IP (Transmission Control Protocol / Internet Protocol) is possible.

尚、ＣＰＵ２０１は、例えばＲＡＭ２０２内の表示情報用領域へアウトラインフォントの展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上に表示することが可能である。また、ＣＰＵ２０１は、ディスプレイ２１０上のマウスカーソル（図示しない）等によるユーザ指示を可能とする。 The CPU 201 can display on the display 210, for example, by executing an outline font rasterizing process in the display information area in the RAM 202. Further, the CPU 201 enables a user instruction with a mouse cursor (not shown) or the like on the display 210.

本発明を実現するための後述する各種プログラムは、外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０２にロードされることによりＣＰＵ２０１によって実行されるものである。 Various programs to be described later for implementing the present invention are recorded in the external memory 211, and are executed by the CPU 201 by being loaded into the RAM 202 as necessary.

図３は本発明の実施形態に係わる検索対象となる文書の一例である。例として、ソフトウェア製品のサポートに用いられるＦＡＱ（よくある質問）集を記載しているが、本願発明における選択対象の文書は論文、新聞記事、会社規定、ＳＮＳ上の投稿などテキストを含むものであれば何でもよく、ＦＡＱに限定するものではない。 FIG. 3 is an example of a document to be searched according to the embodiment of the present invention. As an example, the FAQs (Frequently Asked Questions) used to support software products are described, but the documents to be selected in the present invention include texts such as papers, newspaper articles, company regulations, and posts on SNS. Anything is acceptable, and it is not limited to FAQ.

３００ａ、３００ｂに記載の例では、１つのＦＡＱには、ＦＡＱを一意的に識別する“ＦＡＱＩＤ”、ＦＡＱ全体の内容を分類する“カテゴリ”、さらにサポート内容、例えば前記ソフトウェア製品のユーザの問合せに対してどのＦＡＱを参照したらよいかの目安となる“質問”、その応答を記載した“回答”などのフィールドから構成され、それぞれのフィールドに対応する内容の記載を含む。ここで、ＦＡＱＩＤは後述する学習データにおいて問合せ（クエリ）の文字列と特定のＦＡＱを対応づけるものであり、フィールドの名称として限定するものではない。その他のフィールドについては、名称や個数を限定するものではない。 In the examples described in 300a and 300b, one FAQ includes a “FAQ ID” that uniquely identifies the FAQ, a “category” that classifies the contents of the entire FAQ, and support contents such as an inquiry of the user of the software product. On the other hand, it is composed of fields such as “question” which is a guide to which FAQ should be referred to and “answer” in which the response is described, and includes description of the contents corresponding to each field. Here, the FAQ ID is for associating a character string of an inquiry (query) with a specific FAQ in learning data described later, and is not limited to a field name. The names and the number of other fields are not limited.

図４は、本発明の実施形態に係わる学習データの一例である。図４の学習データの例には、図３のＦＡＱ３００ａに対応する学習データ（１０件分）４００ａと、ＦＡＱ３００ｂに対応する学習データ（１０件分）４００ｂを例示している。 FIG. 4 is an example of learning data according to the embodiment of the present invention. As an example of the learning data in FIG. 4, learning data (10 cases) 400a corresponding to the FAQ 300a and learning data (10 cases) 400b corresponding to the FAQ 300b in FIG. 3 are illustrated.

各学習データはクエリ４０１（あるいは問合せ、検索条件などと呼ぶことがある）と、ＦＡＱ対応付け４０２がある。ＦＡＱ対応付け４０２にはＦＡＱＩＤが格納されており、前記学習データが図３に例示したＦＡＱの何れに対応しているかを識別するものである。 Each learning data has a query 401 (or sometimes referred to as an inquiry or a search condition) and an FAQ correspondence 402. A FAQ ID is stored in the FAQ association 402 and identifies which of the FAQs shown in FIG. 3 the learning data corresponds to.

これらの学習データは、ユーザが実際に本願発明の類似検索エンジンを用いるときに、ユーザが入力した問合せ内容（クエリ）の意図に一致したＦＡＱを選択する、などして収集したログから得ることができる。また本願発明の機能の一部としてではなく、ユーザとサポートの担当者がやり取りしたメールなどから問合せ内容（クエリ）と対応するＦＡＱを特定して収集してもよい。
図５は、本発明の実施形態に係る生成された素性ベクトルの一例である。本発明における類似検索は、“学習時”および実際にユーザなどがＦＡＱを閲覧する“問合せ時”がある。何れの場合も素性ベクトルを生成する。 These learning data can be obtained from a log collected by, for example, selecting an FAQ that matches the intention of the inquiry content (query) input by the user when the user actually uses the similar search engine of the present invention. it can. Further, the FAQ corresponding to the inquiry content (query) may be specified and collected from an email or the like exchanged between the user and the person in charge of support, not as a part of the function of the present invention.
FIG. 5 is an example of the generated feature vector according to the embodiment of the present invention. The similarity search in the present invention includes “at the time of learning” and “at the time of inquiry” in which the user actually browses the FAQ. In either case, a feature vector is generated.

素性ベクトルは、学習データに含まれるクエリ４０１を検索条件として、情報検索部１３１で検索対象文書記憶部１２４を検索した結果の文書（本例ではＦＡＱ）のうちの１つに言語処理を施すことによって得られる数値の並び（これをベクトルと見なす）である。 The feature vector is subjected to language processing on one of the documents (FAQ in this example) obtained as a result of searching the search target document storage unit 124 by the information search unit 131 using the query 401 included in the learning data as the search condition. It is a sequence of numerical values obtained by (this is regarded as a vector).

１つのＦＡＱに含まれる３つのフィールド（類似度算出フィールド５０１ａ）の各々と、クエリ４０１を、類似度指標５０２に記載の３つの類似度算出手法で計算する。例えば、５０３が指し示す項目の値１．２は、クエリ４０１と質問フィールドにある文章をＢＭ２５という類似度算出手法で計算し、得られた値である。 Each of the three fields (similarity calculation field 501a) included in one FAQ and the query 401 is calculated by the three similarity calculation methods described in the similarity index 502. For example, the value 1.2 of the item indicated by 503 is a value obtained by calculating the text in the query 401 and the question field by the similarity calculation method called BM25.

５００ａには９つの数値を表形式で記載しているが、これを一列に並べたものを（本例では９次元の）素性ベクトルとするが、以下表形式の記載も素性ベクトルと同一視する。すなわち素性ベクトル５００ａと呼ぶ。 In the table 500a, nine numerical values are described in a tabular form, and the ones arranged in a row are referred to as feature vectors (nine dimensions in this example), but the description in the table below is also equated with the feature vector. .. That is, it is called a feature vector 500a.

なお、本例で計算に用いたフィールドはあくまで例である。ＦＡＱＩＤは言語的意味がないので通常用いない場合が多いと推測されるが、他の３つのフィールドを全て用いなければならないわけではない。特に言語的特徴が強く表れるものを用いればよい。 The fields used for calculation in this example are merely examples. It is presumed that the FAQID is usually not used because it has no linguistic meaning, but it is not necessary to use all the other three fields. In particular, it is possible to use a language that has a strong linguistic feature.

更に類似度指標５０２に記載した３つの手法もあくまで例であり、他にも様々な計算手法がある。これらの詳細については、自然言語処理において周知の技術であるため割愛する。 Furthermore, the three methods described in the similarity index 502 are merely examples, and there are various other calculation methods. Details of these are omitted because they are well-known techniques in natural language processing.

図６は、本発明の実施形態に係わる学習データから抽出された特徴語の一例である。本発明の実施例においては、学習データ４００から“特徴語”を抽出する。特徴語の抽出は、例えばあるテキスト群において、テキストＡから抽出される単語と、テキスト群の中のテキストＡ以外のテキストから抽出される単語を統計的に比較し、テキストＡを特徴付ける（他の文書と区別する）と判断される単語を、テキストＡの特徴語として抽出するものである。特徴語抽出の方式には様々な種類があるが、これらの詳細においては自然言語処理技術において周知の技術であるため割愛する。 FIG. 6 is an example of characteristic words extracted from the learning data according to the embodiment of the present invention. In the embodiment of the present invention, the “feature word” is extracted from the learning data 400. To extract the characteristic word, for example, in a certain text group, a word extracted from the text A is statistically compared with a word extracted from a text other than the text A in the text group to characterize the text A (other A word that is determined to be distinguished from a document) is extracted as a characteristic word of the text A. There are various types of feature word extraction methods, but these details are omitted because they are well known in the natural language processing technology.

特徴語６００ａは、ＦＡＱＩＤ＝３１９８を正解とする学習データ全件（図４の学習データ４００ａ）から上位１０個の特徴語を抽出したものである。ここで上位１０個とは、前述の特徴語抽出の処理において、各単語に“重要度”が付与されるが、例えば図１１の設定記憶部１２２の“特徴語個数”に“１０”と記載があるため、重要度が高い上位１０個を選択したものである。あるいは、“特徴語重要度”に“０．５”とある記載により、重要度が０．５以上の単語を個数にかかわらず選択してもよい。 The characteristic word 600a is obtained by extracting the top 10 characteristic words from all the learning data (learning data 400a in FIG. 4) whose correct answer is FAQ ID = 3198. Here, the "top 10" is given "importance" to each word in the above-described feature word extraction process. For example, "10" is written in the "feature word count" of the setting storage unit 122 in FIG. Therefore, the top 10 most important ones are selected. Alternatively, by describing "0.5" in "feature word importance", words having an importance of 0.5 or more may be selected regardless of the number.

特徴語６００ｂは、同様にＦＡＱＩＤ＝００６４を正解とする学習データ全件（図４の学習データ４００ｂ）から上位１０個の特徴語を抽出した例を記載している。 The characteristic word 600b describes an example in which the top 10 characteristic words are extracted from all the learning data items (learning data 400b in FIG. 4) in which the correct answer is FAQID = 0064.

図７は、本発明の実施形態に係る生成された素性ベクトルの一例である。 FIG. 7 is an example of the generated feature vector according to the embodiment of the present invention.

図５で説明した素性ベクトル５００ａに対して、素性ベクトル５００ｂは学習言語情報フィールド７０１という項目を追加している。学習言語情報は、例えば図４の学習データ４００から特徴語を抽出する処理により選択した単語、すなわち図６の特徴語６００を１つの例とする。特徴語６００に格納された情報は実際にはＦＡＱのフィールドではないが、図１２に記載の通り、例えばＦＡＱＩＤ＝３１９８のＦＡＱに対してであれば、索引１２０１の３１９８と記載された項目に対応づけられている学習言語情報１２０２に対応づけて記憶されており、これを取得してＦＡＱＩＤ＝３１９８のＦＡＱの論理的なフィールドとして扱う。 In addition to the feature vector 500a described in FIG. 5, the feature vector 500b has an item called a learning language information field 701 added. As the learning language information, for example, the word selected by the process of extracting the characteristic word from the learning data 400 of FIG. 4, that is, the characteristic word 600 of FIG. 6 is used as an example. The information stored in the characteristic word 600 is not actually a FAQ field, but as shown in FIG. 12, for example, for the FAQ of FAQ ID = 3198, it corresponds to the item described as 3198 of the index 1201. It is stored in association with the associated learning language information 1202, which is acquired and treated as a logical field of the FAQ of FAQ ID = 3198.

すなわち、図５では、３つのフィールドのみを素性ベクトルの生成に用いていたが、図７では特徴語を情報として含む論理的な学習言語情報フィールド７０１を用いる。 That is, in FIG. 5, only three fields are used for generating a feature vector, but in FIG. 7, a logical learning language information field 701 including a feature word as information is used.

学習言語情報フィールド７０１に含まれる情報は、前述の通り、例えばユーザが実際に入力する情報をログとして収集し、学習に用いることができる。 As described above, the information contained in the learning language information field 701 can be used for learning, for example, by collecting the information actually input by the user as a log.

ＦＡＱのフィールド、例えば「質問」、「回答」などは、その関連する業務、技術などに詳しい人が作成した「正しい言葉」で記載された文章であり、専門用語などが用いられる。 The fields of the FAQ, such as “question” and “answer”, are sentences written in “correct words” created by a person who is familiar with the related business and technology, and technical terms are used.

しかしながら実際のユーザは、その「専門用語」も知らずに、自分の言葉を検索条件とする。従って、重要な専門用語が順位学習において必ずしも効果を上げるとは限らない。 However, the actual user does not know the “technical term” and uses his / her own word as the search condition. Therefore, important technical terms do not always have an effect on rank learning.

本願発明の学習言語情報フィールド７０１を用いる方式では、ユーザの検索ログ、即ちユーザの言葉そのものが学習対象となることで、より効果的な学習結果となる効果を得ることができる。 In the method using the learning language information field 701 of the present invention, the user's search log, that is, the user's word itself is the learning target, and thus an effect of more effective learning result can be obtained.

さらに素性ベクトル５００ｃでは、学習言語情報フィールド７０１のみから素性ベクトルを生成している（５０１ｃ）。これにより学習時間を短縮する効果を得ることができる。 Further, in the feature vector 500c, the feature vector is generated only from the learning language information field 701 (501c). As a result, the effect of shortening the learning time can be obtained.

詳細に説明すると、学習時間を決定する要素には幾つかあるが、その中で学習データと学習データに基づいて生成する“素性ベクトル”の次元がある。例えば、ＦＡＱ内の「質問」、「回答」、「カテゴリ」の３つのフィールドに３つの素性計算方法を適用すると、３×３＝９次元の素性ベクトルとなる。これに対して、５００ｃの通り類似度算出フィールドを学習言語情報フィールド１つに制限することで３次元の素性ベクトルとなり順位学習に要する時間が短縮する効果を得ることができる。 Explaining in detail, there are several factors that determine the learning time, and among them, there is the dimension of the "feature vector" generated based on the learning data and the learning data. For example, if three feature calculation methods are applied to the three fields of “question”, “answer”, and “category” in the FAQ, the feature vector becomes 3 × 3 = 9 dimensions. On the other hand, as shown in 500c, by limiting the similarity calculation field to one learning language information field, it becomes a three-dimensional feature vector, and the effect of shortening the time required for order learning can be obtained.

また、学習言語情報フィールド７０１内の情報は、学習言語情報記憶部１２３に格納されているものであるが、本願発明において学習言語情報記憶部１２３に格納されるものは特徴語に限定されるものではない。特に、各ＦＡＱＩＤに対応づけて全ての学習データのクエリを格納してもよい。この場合、学習言語情報フィールド７０１に格納されるのは、学習データのクエリ全文となるが、順位学習時に自然言語処理により解析され、単語などの言語的特徴が抽出されるため同様に実行可能である。またこれらの例、すなわち、特徴語あるいは全文に限定するものではなく、学習データの特徴を言語的に示すものであればよい。 Further, the information in the learning language information field 701 is stored in the learning language information storage unit 123, but the information stored in the learning language information storage unit 123 in the present invention is limited to feature words. is not. In particular, all the learning data queries may be stored in association with each FAQ ID. In this case, the entire sentence of the query of the learning data is stored in the learning language information field 701, but it can be similarly executed because it is analyzed by natural language processing at the time of order learning and linguistic features such as words are extracted. is there. Further, the examples are not limited to the characteristic words or the whole sentence, but may be linguistically indicating the characteristics of the learning data.

ただし特徴語を用いることは前述と同じく、順位学習にかかる時間を短縮する。特徴語は、学習データのクエリ全文と比較して言語的な情報量が少なくなっている。そのため、抽出された特徴語に限定することでクエリ全文を格納するよりも順位学習にかかる時間を短縮する効果を得ることができる。 However, the use of feature words shortens the time required for order learning, as described above. Feature words have less linguistic information than the full-text query of learning data. Therefore, by limiting the extracted feature words, it is possible to obtain an effect of shortening the time required for rank learning as compared with storing the entire query text.

さらに、全てのＦＡＱに対して学習データが存在するわけではない。すなわち、学習時の情報検索部１３１による検索結果は、ユーザが適切な回答であると指定したＦＡＱ以外に、不適切な回答も含まれる。適切なＦＡＱ（正解となる文書）においては、前記学習データが少なくとも１つあることになるが、その他の文書は、対応する学習データがあるとは限らないからである。実際、ＦＡＱの中でもユーザが閲覧し、問い合わせに対応して閲覧するＦＡＱは偏っているのが一般的であり、比較的大きな割合のＦＡＱは閲覧さえされない。そのような理由により、対応する学習データが１つも存在しないＦＡＱも多数あることになる。 Furthermore, learning data does not exist for all FAQs. That is, the search result by the information search unit 131 at the time of learning includes an inappropriate answer in addition to the FAQ designated by the user as an appropriate answer. This is because there is at least one learning data item in a proper FAQ (correct answer document), but other documents do not always have corresponding learning data items. In fact, among the FAQs, the FAQs that the user browses and browses in response to an inquiry are generally biased, and a relatively large proportion of FAQs are not even browsed. For that reason, there are many FAQs that do not have any corresponding learning data.

学習データが存在しない場合は、学習対象を学習言語情報フィールド７０１のみに限定すると、素性ベクトルを生成することができなくなる。その場合に対応するため、例えば、情報検索部１３１により前記学習データのクエリで検索した際のスコアを素性の１つとして追加することで、少なくとも１素性が０ではない素性ベクトルの生成を可能とする効果を得ることができる。 If the learning data does not exist, the feature vector cannot be generated if the learning target is limited to only the learning language information field 701. In order to cope with such a case, for example, by adding the score when the information search unit 131 searches the query of the learning data as one of the features, it is possible to generate a feature vector in which at least one feature is not 0. The effect can be obtained.

このスコアに関しては、例として５００ｃのみにＳｃｏｒｅ７０２として記載しているが、５００ａ、５００ｂに追加してもよいことはいうまでもない。 Regarding this score, as an example, only Score 500c is described as Score 702, but it goes without saying that it may be added to Score 500a and 500b.

図８は、本発明の実施形態に係る学習データ件数の分布を示すグラフの一例である。前述までの段落では、学習データ前処理部１０１で特徴語を抽出するための処理を説明してきた。他の方法として、学習に使用する学習データを学習データ前処理部１０１で絞り込むことを説明するためのグラフである。 FIG. 8 is an example of a graph showing the distribution of the number of learning data items according to the embodiment of the present invention. In the above paragraphs, the processing for extracting the characteristic word in the learning data preprocessing unit 101 has been described. 11 is a graph for explaining that, as another method, learning data used for learning is narrowed down by the learning data preprocessing unit 101.

そもそも順位学習とは、学習に関連する機能部１０２〜１０４で用いた学習データと、実際の運用時にユーザ条件受付部１１１で受け付け、再ランク付けのための機能部１１２〜１１４で用いるユーザ条件の間で、言語的な特徴に類似性があることを利用し、学習モデルを生成して検索の精度を向上させるものである。 In the first place, the order learning includes learning data used by the functional units 102 to 104 related to learning and user conditions used by the functional units 112 to 114 that are accepted by the user condition acceptance unit 111 during actual operation. Among them, the similarity of linguistic features is utilized to generate a learning model and improve the accuracy of retrieval.

しかしながら学習データは、ユーザの検索ログ、すなわちユーザが実際に検索条件を入力した後、得られた検索結果の中から適切な回答を選択することで得られるものである。しかしながら、常にユーザが適切な回答を選択するとは限らない。 However, the learning data is obtained by the user's search log, that is, by the user actually inputting the search conditions and then selecting an appropriate answer from the obtained search results. However, the user does not always select an appropriate answer.

例えば、閲覧した回答が適切ではないのに誤って適切な回答であるとしたり、ユーザの当初の意図とは無関係な回答がたまたま興味ある内容だったため適切な回答であると指定したりする可能性もある。そのような不適切な学習データまで利用して学習したのでは、最適な学習モデルを得ることはできない。そのため、適切な学習データと不適切な学習データを分類する必要がある。 For example, you may mistakenly view the answer as an appropriate answer, or you may specify that the answer is unrelated to the user's original intention and that it is an appropriate answer because it happened to be interesting. There is also. An optimal learning model cannot be obtained by learning even by using such inappropriate learning data. Therefore, it is necessary to classify appropriate learning data and inappropriate learning data.

＜学習データを分類する実施形態１＞
まず各ＦＡＱを正解とする学習データの数により適切な学習データを選択する方法を提示する。設定パラメータ１１００（図１１）にある学習実行件数に１０とある記載に基づき、１つのＦＡＱに対応する学習データが１０件以上ある場合だけ、それらの学習データを適切な学習データとして、順位学習に利用する方法を提示する。 <First embodiment for classifying learning data>
First, a method of selecting appropriate learning data according to the number of learning data having each FAQ as a correct answer will be presented. Based on the description that the number of learning executions in the setting parameter 1100 (FIG. 11) is 10, only when there are 10 or more learning data corresponding to one FAQ, those learning data are regarded as appropriate learning data and used for rank learning. Show the method to use.

学習データ数グラフ８００は、各ＦＡＱを正解とする学習データの数である。横軸がＦＡＱＩＤ、縦軸が対応する学習データの件数を表している。学習データは左側から件数が多い順に並べている。 The learning data number graph 800 is the number of learning data having each FAQ as a correct answer. The horizontal axis represents the FAQ ID, and the vertical axis represents the number of corresponding learning data items. The learning data are arranged from left to right in descending order.

前述の通り、学習データは特定のＦＡＱに偏って存在する場合が一般的である。８０４の範囲にあるＦＡＱＩＤ、即ち点線の８０２より左にあるＦＡＱＩＤは、対応する学習データが１０件以上あるものを示している。例では、最も学習データが多いＦＡＱは４０件、図３に例としてあげたＦＡＱ３００ａ（ＦＡＱＩＤが３１９８のもの）は、３０件以上、ＦＡＱ３００ｂ（ＦＡＱＩＤが００６４のもの）は１０数件の対応する学習データがある。 As described above, it is general that the learning data is biased toward a specific FAQ. The FAQ IDs in the range of 804, that is, the FAQ IDs on the left of the dotted line 802, indicate that there are 10 or more corresponding learning data. In the example, the FAQ with the most learning data is 40, the FAQ 300a (the FAQ ID is 3198) illustrated in FIG. There is data.

一方、８０５の範囲のＦＡＱＩＤ、すなわち点線の８０２と８０３に挟まれた部分は、学習データが１０件未満であるもの、さらに８０６の範囲（点線の８０３より右）は、学習データが１件もないＦＡＱに対応している。 On the other hand, the FAQ ID in the range of 805, that is, the portion sandwiched between the dotted lines 802 and 803 has less than 10 learning data, and the range of 806 (right of the dotted line 803) has even one learning data. It corresponds to no FAQ.

前述の通り、学習データにも不適切なものがあり、それは一定の確率でまれに発生すると思われる。従って特定のＦＡＱに対応する学習データが１件、あるいは数件である場合には、不適切な学習データが存在する可能性も低いが、一方で例え１件でも不適切な学習データが存在すると、学習データとして悪い効果が大きな影響を与えてしまう。前述のように特徴語を抽出した場合には、多くの不適切な特徴語が選択されてしまう。 As mentioned above, some training data is inappropriate, and it seems that it occurs with a certain probability and rarely. Therefore, when the number of learning data corresponding to a specific FAQ is one or several, it is unlikely that there is inappropriate learning data, but on the other hand, even if there is one, there is inappropriate learning data. , The bad effect as learning data has a big influence. When the characteristic words are extracted as described above, many inappropriate characteristic words are selected.

一方で、対応する学習データが数十件あるような場合に、１件の不適切な学習データが含まれていたとしても、数十件の学習データからその統計的に言語的特徴を抽出する中で、ほぼ悪影響を与えることはなくなる。 On the other hand, when there are dozens of corresponding learning data, even if one inappropriate learning data is included, statistically linguistic features are extracted from dozens of learning data. Among them, there will be almost no adverse effects.

すなわち、一定の低い確率で不適切な学習データが存在するとしても、特定のＦＡＱに対応する学習データが多ければ多いほど、無視してもよい可能性が高く、学習データが少なければ少ないほど無視できないことになる。その観点で、例えばＦＡＱに対して対応する学習データが１０件未満の場合は、それらの学習データを使用しない、ということにすることで、学習に悪影響を与える原因を除外することになる。 That is, even if there is inappropriate learning data with a certain low probability, the more learning data corresponding to a particular FAQ, the higher the possibility of ignoring, and the less the learning data, the more ignoring. It will not be possible. From that point of view, for example, when there are less than 10 pieces of learning data corresponding to the FAQ, it is decided not to use those learning data, so that the cause of adversely affecting learning is excluded.

また、グラフ８００の例で言うと、実際に頻度高く問合せが成されるＦＡＱは、８０４に集中するので、この部分を大量の学習データで順位学習し、精度を高めることでユーザにとって適切な結果を返すことになり、逆にあまり問い合わせられることがない８０５、８０６の範囲の学習が全く成されない状況であっても、ユーザが問題を感じる確率は低くなる。 Further, in the example of the graph 800, since the FAQs that are actually frequently inquired are concentrated in 804, this portion is subjected to rank learning with a large amount of learning data and the accuracy is improved, so that an appropriate result for the user can be obtained. Therefore, even if the learning in the range of 805 and 806, which is rarely inquired, is not performed at all, the probability that the user feels a problem is low.

以上の方法でユーザの実際の使用頻度に応じて最適な学習を実施可能となる効果を得る。 With the method described above, it is possible to obtain the effect that the optimum learning can be performed according to the actual usage frequency of the user.

＜学習データを分類する実施形態２＞
その他の方法を記載する。学習データに含まれるクエリが適切なものであれば、学習をしていない状態であっても（すなわち情報検索部１３１による検索であっても）、正解となるＦＡＱは比較的上位に来る。そのため、例えば５０位を閾値として、学習データのクエリで検索した結果、対応づけられる正解のＦＡＱが５０位以内に入っている場合は、その学習データはよい学習データである、と見なす。 <Second embodiment for classifying learning data>
Describe other methods. If the query included in the learning data is appropriate, even if the learning is not performed (that is, even if the search is performed by the information search unit 131), the correct answer FAQ is relatively high. Therefore, for example, when the correct answer FAQ that is associated is within the 50th place as a result of searching with the learning data query using the 50th place as a threshold, the learning data is regarded as good learning data.

＜学習データを分類する実施形態３＞
実施形態２と類似の方法として、閾値として順位ではなく、クエリとＦＡＱの検索結果の類似度（検索スコア）を用いる。すなわち、類似度が一定の値以上であれば、よい学習データであると見なす。 <Third embodiment for classifying learning data>
As a method similar to the second embodiment, not the rank but the similarity (search score) between the query and the FAQ search results is used as the threshold. That is, if the degree of similarity is a certain value or more, it is considered as good learning data.

＜学習データを分類する実施形態４＞
実施形態４として、実施形態３，４を合わせて順位と類似度の両方を閾値とする方法もある。その他、学習データと検索結果の中の正解に対し、類似度と関連する数値的な情報、内部に含まれる単語など言語的な情報、また単一の学習データではなく他の学習データと正解ＦＡＱから得られる統計値などを用いて分類できるのであれば、如何なる方法であってもよいことはいうまでもない。 <Fourth embodiment for classifying learning data>
As a fourth embodiment, there is also a method of combining both the third and fourth embodiments and using both the rank and the similarity as threshold values. Other than the correct answer in the learning data and the search results, numerical information related to the degree of similarity, linguistic information such as words contained inside, and other learning data and correct answer FAQ instead of a single learning data It goes without saying that any method may be used as long as it can be classified using the statistical values obtained from the above.

図９は、本発明の実施形態に係る学習時の処理、すなわち学習モデルの生成を説明するフローチャートの一例である。 FIG. 9 is an example of a flowchart for explaining the processing at the time of learning according to the embodiment of the present invention, that is, the generation of the learning model.

ステップＳ９０１においては、学習データ記憶部１２１に記憶されている学習データを読み出す。 In step S901, the learning data stored in the learning data storage unit 121 is read.

ステップＳ９０２においては、前記学習データに対して、順位学習に用いる情報を抽出し学習言語情報記憶部１２３に格納する。例えば、例えば図６で示した特徴語を抽出し、図１２のように学習データに対応するＦＡＱに紐付けて格納する。その際、図８で説明したように処理対象とする学習データをあらかじめ選択してもよい。また他の例として前述の通り、各ＦＡＱＩＤに対応づけて学習データのクエリをそのまま格納してもよい。すなわち、実質的に抽出、選択などの処理を行わず、格納だけを行ってもよい。 In step S902, information used for rank learning is extracted from the learning data and stored in the learning language information storage unit 123. For example, the characteristic word shown in FIG. 6 is extracted and stored in association with the FAQ corresponding to the learning data as shown in FIG. At that time, the learning data to be processed may be selected in advance as described in FIG. As another example, as described above, the query of the learning data may be stored as it is in association with each FAQ ID. That is, it is possible to store only without substantially performing processes such as extraction and selection.

ステップＳ９０３からステップＳ９１１は、ステップＳ９０１で読み出した全ての学習データ、あるいはステップＳ９０２で学習データを一部選択したのであれば、選択された全ての学習データに対する繰り返し処理を実施する。 In steps S903 to S911, if all the learning data read in step S901 or some of the learning data is selected in step S902, the iterative process is performed on all the selected learning data.

ステップＳ９０４においては、前記の学習データの１つに着目し、ステップＳ９０５においては、情報検索部１３１が、当該学習データのクエリにより検索対象文書記憶部１２４を検索する。ここでは、前記クエリに対して１つまたは複数の文書がヒットする。ヒットする文書がない場合もあるが、その場合は以下の処理を中断し繰り返し処理にて次の学習データに着目する。 In step S904, attention is paid to one of the learning data, and in step S905, the information search unit 131 searches the search target document storage unit 124 by the query of the learning data. Here, one or more documents are hit for the query. In some cases, there is no hit document, but in that case, the following processing is interrupted and the next learning data is focused on by repeating the processing.

ステップＳ９０６からステップＳ９１０は、前記着目中のクエリに検索ヒットした文書に対する繰り返し処理である。 Steps S906 to S910 are an iterative process for the document hit by the query of interest.

ステップＳ９０７において、前記着目中のクエリでヒットした文書のうちの１つに着目する。 In step S907, attention is paid to one of the documents hit by the query in question.

ステップＳ９０８においては、着目中の文書に対応する学習言語情報、例えば着目中の文書がＦＡＱＩＤ＝３１９８であれば、図１２に示す当該文書（のＦＡＱＩＤ）に対応する特徴語を取得する。 In step S908, learning language information corresponding to the document of interest, for example, if the document of interest has FAQ ID = 3198, the characteristic word corresponding to (the FAQ ID of) the document shown in FIG. 12 is acquired.

ステップＳ９０９においては、着目中のクエリと着目中の文書の学習対象となるフィールドから素性を計算する。ここでステップＳ９０８において取得した特徴語を（論理的な）学習言語情報フィールド７０１も学習対象とする。複数の計算方法、フィールドに対して計算することで図７の５００ｂ、５００ｃで例示する素性ベクトルを生成する。さらに生成した素性ベクトルを座標空間に写像する。この際、着目中の文書が着目中のクエリの正解となる回答か否か、学習データに記載があるため、写像先の座標に対応づけて、製開会中を示すラベルが記憶される。すべての学習データ、検索ヒットした文書に対して素性ベクトルの座標軸への写像が完了したらステップＳ９１２に進む。 In step S909, a feature is calculated from the query of interest and the field to be learned of the document of interest. Here, the characteristic word acquired in step S908 is also set as a learning target in the (logical) learning language information field 701. The feature vectors exemplified by 500b and 500c in FIG. 7 are generated by performing calculations for a plurality of calculation methods and fields. Furthermore, the generated feature vector is mapped to the coordinate space. At this time, since there is a description in the learning data as to whether or not the document under consideration is the correct answer for the query under consideration, a label indicating that the session is being held is stored in association with the coordinates of the mapping destination. When the mapping of the feature vector to the coordinate axes has been completed for all the learning data and the document hit by the search, the process proceeds to step S912.

ステップＳ９１２においては、写像された全ての素性ベクトルに基づき適切な学習モデルを生成し、当該学習モデルを学習モデル記憶部１２５に格納する。学習モデルの生成については、各種方法が提示されており、周知の技術であるので詳細の説明は割愛する。 In step S912, an appropriate learning model is generated based on all the mapped feature vectors, and the learning model is stored in the learning model storage unit 125. Regarding the generation of the learning model, various methods have been presented and are well-known techniques, so a detailed description thereof will be omitted.

以上で、学習データと検索対象文書を用いて学習モデルを生成する処理を完了する。これにて図９のフローチャートの説明を完了する。 This completes the process of generating the learning model using the learning data and the search target document. This completes the description of the flowchart of FIG.

図１０は、本発明の実施形態に係る学習結果に基づく情報検索・再ランク付けの処理を説明するフローチャートの一例である。 FIG. 10 is an example of a flowchart illustrating the information search / reranking process based on the learning result according to the embodiment of the present invention.

ステップＳ１００１においては、ユーザ条件受付部１１１が、ユーザ（あるいは他のアプリケーション）の検索条件（クエリ）を受け付ける。 In step S1001, the user condition receiving unit 111 receives the search condition (query) of the user (or another application).

ステップＳ１００２においては、情報検索部１３１が前記受け付けたクエリで、検索対象文書記憶部１２４を検索し、ヒットした文書（ＦＡＱなど）を返す。 In step S1002, the information search unit 131 searches the search target document storage unit 124 with the received query and returns the hit document (FAQ or the like).

ステップＳ１００３からステップＳ１００７は、前記ヒットした文書全てに対する繰り返し処理である。 Steps S1003 to S1007 are repetitive processing for all the hit documents.

ステップＳ１００４においては、前記検索結果の文書の１つに着目する。 In step S1004, attention is paid to one of the documents as the search result.

ステップＳ１００５においては、着目した文書に対応する学習言語情報を取得する。たとえば着目中の文書が図３のＦＡＱ３００ａであれば、図１２の学習言語情報記憶部１２３の索引１２０１内の３１９８に対応する学習言語情報１２０２を取得する。例として、１０個の特徴語１２０３が記載されている。 In step S1005, learning language information corresponding to the document of interest is acquired. For example, if the document of interest is the FAQ 300a in FIG. 3, the learning language information 1202 corresponding to 3198 in the index 1201 of the learning language information storage unit 123 in FIG. 12 is acquired. As an example, 10 characteristic words 1203 are described.

ステップＳ１００６においては、ユーザ条件受付部１１１で受け付けたクエリと、論理的な学習言語情報フィールド７０１を含むＦＡＱ３００ａとから得られる素性ベクトルを生成する。例えば図７の５００ｂ、５００ｃのような素性ベクトルとなる。 In step S1006, a feature vector obtained from the query accepted by the user condition accepting unit 111 and the FAQ 300a including the logical learning language information field 701 is generated. For example, feature vectors such as 500b and 500c in FIG. 7 are obtained.

さらに学習モデル記憶部１２５から図９で説明した学習モデルを含む座標系に素性ベクトルを写像する（学習モデルは、図１０のフローチャート開始時などに一度読み出しておけばよい）。 Further, the feature vector is mapped from the learning model storage unit 125 to the coordinate system including the learning model described in FIG. 9 (the learning model may be read once at the start of the flowchart in FIG. 10).

繰り返し処理の結果、前記ユーザの検索条件（クエリ）にヒットした文書の数だけ写像されることになる。これによりステップＳ１００３からステップＳ１００７の繰り返し処理は完了し、ステップＳ１００８に進む。なお図９のフローチャートで説明した場合と異なり、前記クエリでは正解となるＦＡＱが何か分かっていないので正解か否かを示すラベルは示されていない。 As a result of the iterative process, the number of documents hit by the search condition (query) of the user is mapped. As a result, the repeating process of steps S1003 to S1007 is completed, and the process proceeds to step S1008. Note that unlike the case described in the flowchart of FIG. 9, in the above-mentioned query, since the correct FAQ is not known, a label indicating whether or not it is the correct answer is not shown.

ステップＳ１００８においては、ユーザの検索条件（クエリ）に対してヒットした全ての文書に対応する素性ベクトルの写像は、図９の処理で生成された学習モデルと比較され、順位付けが成される。この順位付けについては、学習モデルの生成方法との対応で周知の技術であるため、詳細の説明は割愛する。 In step S1008, the mapping of the feature vectors corresponding to all the documents hit with respect to the user's search condition (query) is compared with the learning model generated in the process of FIG. 9 to rank the maps. Since this ranking is a well-known technique in correspondence with the method of generating the learning model, detailed description thereof will be omitted.

以上の処理で順位付けされた文書（クエリにヒットした前記文書）が、検索処理の呼び出し側に提示される。以上で、図１０を用いたフローチャートの説明を完了する。 The documents ranked in the above process (the documents that hit the query) are presented to the caller of the search process. This is the end of the description of the flowchart using FIG.

なお、上述した各種データの構成及びその内容はこれに限定されるものではなく、用途や目的に応じて、様々な構成や内容で構成されることは言うまでもない。 It is needless to say that the structure and contents of the various data described above are not limited to this, and may be composed of various structures and contents according to the use and purpose.

以上、いくつかの実施形態について示したが、本発明は、例えば、システム、装置、方法、コンピュータプログラムもしくは記録媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although some embodiments have been described above, the present invention can be embodied as, for example, a system, an apparatus, a method, a computer program, a recording medium, or the like, and more specifically, a plurality of devices. The present invention may be applied to a system composed of, or may be applied to a device composed of one device.

また、本発明におけるコンピュータプログラムは、図９〜図１０に示すフローチャートの処理方法をコンピュータが実行可能なコンピュータプログラムであり、本発明の記憶媒体は図９〜図１０の処理方法をコンピュータが実行可能なコンピュータプログラムが記憶されている。なお、本発明におけるコンピュータプログラムは図９〜図１０の各装置の処理方法ごとのコンピュータプログラムであってもよい。 Further, the computer program in the present invention is a computer program that allows the computer to execute the processing method of the flowcharts shown in FIGS. 9 to 10, and the storage medium of the present invention can execute the processing method in FIGS. 9 to 10 by the computer. Various computer programs are stored. The computer program according to the present invention may be a computer program for each processing method of each device in FIGS. 9 to 10.

以上のように、前述した実施形態の機能を実現するコンピュータプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたコンピュータプログラムを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。 As described above, a computer in which a recording medium recording a computer program that realizes the functions of the above-described embodiments is supplied to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus is stored in the recording medium. It goes without saying that the object of the present invention can also be achieved by reading and executing the program.

この場合、記録媒体から読み出されたコンピュータプログラム自体が本発明の新規な機能を実現することになり、そのコンピュータプログラムを記憶した記録媒体は本発明を構成することになる。 In this case, the computer program itself read from the recording medium realizes the novel function of the present invention, and the recording medium storing the computer program constitutes the present invention.

コンピュータプログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク、ソリッドステートドライブ等を用いることができる。 As a recording medium for supplying the computer program, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a DVD-ROM, a magnetic tape, a non-volatile memory card, a ROM, an EEPROM, A silicon disk, a solid state drive, etc. can be used.

また、コンピュータが読み出したコンピュータプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのコンピュータプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the computer program read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) or the like running on the computer is executed based on the instruction of the computer program. It goes without saying that a case where some or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing is also included.

さらに、記録媒体から読み出されたコンピュータプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのコンピュータプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Furthermore, after the computer program read from the recording medium is written in the memory provided in the function extension board inserted in the computer or the function extension unit connected to the computer, the function is executed based on the instruction of the computer program code. It goes without saying that a case where the CPU or the like provided in the expansion board or the function expansion unit performs a part or all of the actual processing and the processing realizes the functions of the above-described embodiments is also included.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。また、本発明は、システムあるいは装置にコンピュータプログラムを供給することによって達成される場合にも適応できることは言うまでもない。この場合、本発明を達成するためのコンピュータプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system including a plurality of devices or an apparatus including one device. Further, it goes without saying that the present invention can be applied to the case where it is achieved by supplying a computer program to a system or an apparatus. In this case, by reading the recording medium storing the computer program for achieving the present invention into the system or apparatus, the system or apparatus can enjoy the effects of the present invention.

さらに、本発明を達成するためのコンピュータプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, by downloading and reading a computer program for achieving the present invention from a server, a database or the like on a network using a communication program, the system or apparatus can enjoy the effects of the present invention.

なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 It should be noted that the present invention also includes all configurations that combine the above-described embodiments and modifications thereof.

１００情報処理装置
１０１学習データ前処理部
１０２学習時検索部
１０３情報検索部
１０４学習用素性ベクトル生成部
１０５学習用素性ベクトルマッピング部
１０６学習モデル生成部
１１１ユーザ条件受付部
１１２ユーザ条件検索部
１１３再ランク付け用素性ベクトル生成部
１１４再ランク付け用素性ベクトルマッピング部
１１５再ランク付け部
１２１学習データ記憶部
１２２設定記憶部
１２３学習言語情報記憶部
１２４検索対象文書記憶部
１２５学習モデル記憶部 100 information processing device 101 learning data pre-processing unit 102 learning time searching unit 103 information searching unit 104 learning feature vector generation unit 105 learning feature vector mapping unit 106 learning model generation unit 111 user condition acceptance unit 112 user condition search unit 113 re Ranking feature vector generation unit 114 Reranking feature vector mapping unit 115 Reranking unit 121 Learning data storage unit 122 Setting storage unit 123 Learning language information storage unit 124 Search target document storage unit 125 Learning model storage unit

Claims

An information processing apparatus comprising: a search unit that searches a search target document with a search query text; and a storage unit that stores the search target document and a learning search query text associated with the search target document,
Creating means for creating additional text information for the search target document using the learning search query text associated with the search target document;
An information processing apparatus comprising: a learning unit that performs order learning using a set of the learning search query text and document data including additional text information for the search target document.

The information processing apparatus according to claim 1, wherein the learning unit performs order learning using a set of the search query text for learning and document data including only additional text information for the search target document. ..

The information processing apparatus according to claim 1, wherein the learning unit performs rank learning using feature data calculated from the learning search query text and the document data.

The information processing apparatus according to claim 3, wherein the feature data includes a search score when the search target document corresponding to the document data is searched by the learning search query text.

The information processing apparatus according to claim 1, wherein the additional text information includes a characteristic word extracted from the learning search query text.

The information processing apparatus according to claim 5, wherein the number of characteristic words included in the additional text information is limited according to a predetermined value.

The learning unit may include additional text information for the search target document in the document data when the number of learning search query texts associated with each of the search target documents exceeds a predetermined value. The information processing device according to claim 1.

A control method for an information processing apparatus, comprising: a search unit that searches a search target document with a search query text; and a storage unit that stores the search target document and a learning search query text associated with the search target document. hand,
A creating step in which creating means creates additional text information for the search target document by using the learning search query text associated with the search target document;
A control method for an information processing apparatus, comprising: a learning step in which a learning step performs learning in order using a set of the learning search query text and document data including additional text information for the search target document.

A program executable in an information processing apparatus including a search unit that searches a search target document with a search query text, and a storage unit that stores the search target document and a learning search query text associated with the search target document And
The information processing device,
Creating means for creating additional text information for the search target document using the learning search query text associated with the search target document;
A program that functions as a learning unit that performs order learning using a set of the learning search query text and document data including additional text information for the search target document.