JP7012811B1

JP7012811B1 - Search device, search method, and program

Info

Publication number: JP7012811B1
Application number: JP2020205794A
Authority: JP
Inventors: 徳章川前
Original assignee: エヌ・ティ・ティ・コムウェア株式会社
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-01-28
Anticipated expiration: 2040-12-11
Also published as: JP2022092849A

Abstract

【課題】質問に対する回答の検索効率を向上する。【解決手段】検索システム１は、学習部１０と検索部２０を備える。学習部１０は、処理対象のテキスト群を入力し、テキスト群の各テキストの単語および品詞の重要度を表す行列Ｄ，Ｓをトピックを基底としてＮＭＦにより分解したトピックモデルとテキスト群の単語の共起に基づくスキップグラムモデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する。検索部２０は、単語または文書をモデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する。【選択図】図１PROBLEM TO BE SOLVED: To improve the search efficiency of an answer to a question. A search system 1 includes a learning unit 10 and a search unit 20. The learning unit 10 inputs the text group to be processed, and co-occurrence the topic model and the words of the text group, which are decomposed by NMF based on the matrices D and S representing the importance of the words and part of speech of each text of the text group. Simultaneously learn with a skipgram model based on co-occurrence to generate a model that outputs a variance vector when a document or word is input. The search unit 20 inputs a word or a document into the model, and outputs a word or a document having a variance vector close to the obtained variance vector as a search result. [Selection diagram] Fig. 1

Description

本発明は、検索装置、検索方法、およびプログラムに関する。 The present invention relates to a search device, a search method, and a program.

既存の検索システムは、データベース検索の延長にあり、キーワード一致したものを検索する。例えば、分散埋め込み表現（分散ベクトル）を用いる検索手法では、単語あるいは文書をそれぞれ分散ベクトルで表現し、そのベクトルの近さが意味の近さになるように学習したベクトルを使って検索を実行する。これにより、単語あるいは文書に対して意味的に近い単語や文書をベクトル演算で検索できるようになる。 The existing search system is an extension of the database search and searches for keyword matches. For example, in a search method using a distributed embedded expression (distributed vector), a word or a document is represented by a distributed vector, and a search is executed using a vector learned so that the closeness of the vector becomes close to the meaning. .. This makes it possible to search for words or documents that are semantically close to the words or documents by vector operation.

国際公開第２０１８／１５１１２５号公報International Publication No. 2018/151125

しかしながら、既存の分散ベクトルは単語の前後情報のみを用いて学習するので、反意語でも類似した分散ベクトルを持つことがある。また、文書は単語より構成されるが、全ての単語が同様の重みをもつわけではないにもかかわらず、それらを考慮して文書の分散ベクトルを学習していないため、文書検索の精度が落ちるという課題がある。 However, since the existing variance vector is learned using only the information before and after the word, the antonym may have a similar variance vector. Also, although a document is composed of words, not all words have the same weight, but the dispersion vector of the document is not learned in consideration of them, so the accuracy of document retrieval is reduced. There is a problem.

本発明は、上記に鑑みてなされたものであり、質問に対する回答の検索効率を向上することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to improve the search efficiency of answers to questions.

本発明の一態様の検索装置は、処理対象のテキスト群を入力し、前記テキスト群の各テキストの単語の重要度を表す行列をトピックを基底として非負値行列因子分解した第１モデルと前記テキスト群の単語の共起に基づく第２モデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する学習部と、単語または文書を前記モデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する検索部を備える。 In the search device of one aspect of the present invention, a first model in which a text group to be processed is input and a matrix representing the importance of a word in each text of the text group is decomposed by a non-negative matrix factor based on a topic and the text. Obtained by learning a second model based on the co-occurrence of words in a group at the same time and inputting a word or document into the model with a learning unit that generates a model that outputs a dispersion vector when a document or word is input. It is provided with a search unit that outputs a word or document having a distribution vector close to the distribution vector as a search result.

本発明によれば、質問に対する回答の検索効率を向上することができる。 According to the present invention, it is possible to improve the search efficiency of answers to questions.

図１は、本実施形態の検索システムの構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of the search system of the present embodiment. 図２は、本実施形態のモデルを非負値行列因子分解により表現した図である。FIG. 2 is a diagram representing the model of the present embodiment by non-negative matrix factorization. 図３は、本実施形態のモデルをｓｋｉｐ－ｇｒａｍにより表現した図である。FIG. 3 is a diagram showing the model of the present embodiment by skip-gram. 図４は、本実施形態の検索システムの学習処理の流れの一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the flow of the learning process of the search system of the present embodiment. 図５は、本実施形態の検索システムの検索処理の流れの一例を示すフローチャートである。FIG. 5 is a flowchart showing an example of the flow of the search process of the search system of the present embodiment.

［システム構成］
以下、本発明の実施の形態について図面を用いて説明する。 [System configuration]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本実施形態の検索システムの構成の一例を示す図である。同図に示す検索システム１は、学習部１０、検索部２０、データ保存部３０、および計算結果記憶部４０を備える。検索システム１が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは検索システム１が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。 FIG. 1 is a diagram showing an example of the configuration of the search system of the present embodiment. The search system 1 shown in the figure includes a learning unit 10, a search unit 20, a data storage unit 30, and a calculation result storage unit 40. Each part included in the search system 1 may be configured by a computer provided with an arithmetic processing unit, a storage device, and the like, and the processing of each part may be executed by a program. This program is stored in a storage device included in the search system 1, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or can be provided through a network.

学習部１０は、処理対象のテキスト群を入力し、トピックモデル（Ｔｏｐｉｃｍｏｄｅｌ）と埋め込みモデル（Ｅｍｂｅｄｄｉｎｇｍｏｄｅｌ）とを同時に学習して、文書集合を入力するとそれに含まれる文書及びその単語の分散ベクトルを生成する。トピックモデルとは、トピックに基づくテキスト単位のグローバルな単語の共起性を用いてテキストの生成過程をモデル化したものである。埋め込みモデルとは、各単語の周辺に出現する単語集合というローカルな情報を用いて単語の共起をモデル化したものである。トピックモデルと埋め込みモデルを結びつけることで、両モデルのメリットを享受したモデルを生成できる。 The learning unit 10 inputs a text group to be processed, learns a topic model (Topic model) and an embedded model (Embedding model) at the same time, and when a document set is input, the document included in the document and the dispersion vector of the word are input. Generate. The topic model is a model of the text generation process using the co-occurrence of global words in text units based on the topic. The embedded model is a model of word co-occurrence using local information such as a word set that appears around each word. By linking the topic model and the embedded model, it is possible to generate a model that enjoys the advantages of both models.

より具体的には、学習部１０は、事前処理部１１と計算処理部１２を備えて、単語の重要度を表す行列をトピックを基底として非負値行列因子分解を用いて行列表現するとともに、単語の共起をｓｋｉｐ－ｇｒａｍを用いて行列表現したモデルを生成する。 More specifically, the learning unit 10 includes a preprocessing unit 11 and a calculation processing unit 12, and expresses a matrix representing the importance of a word based on a topic by using non-negative matrix factorization, and also expresses the word. A model is generated in which the co-occurrence of is expressed in a matrix using skip-gram.

事前処理部１１は、処理対象のテキスト群を形態素解析により単語に分割し、単語の頻出頻度に基づいて各テキストの単語および品詞の重要度を表す行列Ｄ，Ｓを生成するとともに、単語間、品詞間、および単語と品詞間の自己相互情報量（ＰｏｉｎｔｗｉｓｅＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ：ＰＭＩ）に基づいて単語の共起行列Ｗ、品詞の共起行列Ｅ、および単語と品詞の共起行列Ｒを生成する。 The preprocessing unit 11 divides the text group to be processed into words by morphological analysis, generates matrices D and S representing the importance of the words and parts of each text based on the frequency of occurrence of the words, and creates inter-word intervals. The co-occurrence matrix W of words, the co-occurrence matrix E of parts, and the co-occurrence matrix R of words and parts are generated based on the self-mutual information amount (Pointwise Mutual Information: PMI) between parts and between words and parts.

計算処理部１２は、行列Ｄを文書－トピック行列Θとトピック－単語行列Ｔに分解し、行列Ｓを文書－トピック行列Θとトピック－品詞行列Ｇに分解して、行列Θ、行列Ｔ、および行列Ｇを求めるとともに、単語の共起行列Ｗを分解して導出した単語の埋め込み行列Ｂと文脈語の埋め込み行列Ｃ、品詞の共起行列Ｅを分解して導出した品詞の埋め込み行列Ｑと文脈に沿った品詞の埋め込み行列Ｈ、単語と品詞の共起行列Ｒを分解して導出した品詞の埋め込み行列Ｑと文脈語の埋め込み行列Ｃを求め、トピック－単語行列Ｔを分解して導出したトピックの埋め込み行列Ａと文脈語の埋め込み行列Ｃ、トピック－品詞行列Ｇを分解して導出したトピックの埋め込み行列Ａと文脈に沿った品詞の埋め込み行列Ｈを求める。提案モデルの詳細については後述する。 The calculation processing unit 12 decomposes the matrix D into a document-topic matrix Θ and a topic-word matrix T, decomposes the matrix S into a document-topic matrix Θ and a topic-part-timer matrix G, and decomposes the matrix D into a document-topic matrix Θ and a topic-part-timer matrix G, and then decomposes the matrix D into a document-topic matrix Θ and a topic-word matrix T. While finding the matrix G, the embedded matrix B of the word derived by decomposing the co-occurrence matrix W of the word, the embedded matrix C of the context word, and the embedded matrix Q of the part derived by decomposing the co-occurrence matrix E of the part of the word, and the context. The topic-word matrix T is decomposed and derived by finding the part-word embedded matrix Q and the context word embedded matrix C derived by decomposing the part-word embedded matrix H and the co-occurrence matrix R of the word and part. The embedded matrix A of the topic, the embedded matrix C of the context word, the embedded matrix A of the topic derived by decomposing the topic-participation matrix G, and the embedded matrix H of the part words according to the context are obtained. The details of the proposed model will be described later.

検索部２０は、ユーザ端末５から受信した単語または文書をキーとして、モデルから得られた分散ベクトルと近い分散ベクトルを計算結果記憶部４０から検索し、検索で得られた分散ベクトルを持つ単語または文書をデータ保存部３０から取得して検索結果としてユーザ端末５へ返却する。 The search unit 20 searches the calculation result storage unit 40 for a variance vector close to the variance vector obtained from the model using the word or document received from the user terminal 5 as a key, and the word or word having the variance vector obtained by the search. The document is acquired from the data storage unit 30 and returned to the user terminal 5 as a search result.

検索部２０は、例えば、ユーザ端末５から話し言葉で検索内容を入力されたときに、話し言葉をキーとして意味的に近い単語を検索し、検索で得られた単語を検索キーワードとしてモデルに入力して検索結果を得てもよい。 For example, when the search content is input in spoken language from the user terminal 5, the search unit 20 searches for words that are semantically close to each other using the spoken language as a key, and inputs the word obtained in the search into the model as a search keyword. Search results may be obtained.

データ保存部３０は、検索対象の文書と単語を保持する。データ保存部３０の保持する文書は処理対象のテキスト群として学習部１０のモデルの学習にも用いられる。 The data storage unit 30 holds the document and the word to be searched. The document held by the data storage unit 30 is also used for learning the model of the learning unit 10 as a text group to be processed.

計算結果記憶部４０は、文書および単語の分散ベクトルを保持する。文書および単語の分散ベクトルは、文書および単語を学習部１０の生成したモデルに入力することで得られる。計算結果記憶部４０は、文書および単語の意味空間のインデックスであり、空間内での距離の近さは意味の近さになっている。 The calculation result storage unit 40 holds the dispersion vector of the document and the word. The dispersion vector of the document and the word is obtained by inputting the document and the word into the model generated by the learning unit 10. The calculation result storage unit 40 is an index of the semantic space of a document and a word, and the closeness of the distance in the space is the closeness of the meaning.

［提案モデル］
ここで、本実施形態で提案するモデルについて説明する。 [Proposed model]
Here, the model proposed in this embodiment will be described.

図２は、各テキストのグラフィカルモデルを非負値行列因子分解（Ｎｏｎ－ｎｅｇａｔｉｖｅＭａｔｒｉｘＦａｃｔｏｒｉｚａｔｉｏｎ：ＮＭＦ）を用いて行列表現した様子を示す図である。 FIG. 2 is a diagram showing a matrix representation of a graphical model of each text using non-negative matrix factorization (NMF).

図２の左側は、各テキストのグラフィカルモデルである。グラフィカルモデルはノードとエッジから構成され、ノードは変数、エッジの方向はノード間の因果関係を示す。図２のグラフィカルモデルでは、文書ｄ_jのトークンｉにおけるトピックｚ_jiから品詞ｌ_jiが確率的にサンプリングされ、トピックｚ_jiと品詞ｌ_jiから単語ｗ_jiがサンプリングされる様子を示している。 The left side of FIG. 2 is a graphical model of each text. The graphical model consists of nodes and edges, where the nodes are variables and the direction of the edges shows the causal relationship between the nodes. The graphical model of FIG. 2 shows how the part of speech l _ji is stochastically sampled from the topic z _ji in the token i of the document d _j , and the word w _ji is sampled from the topic z _ji and the part of speech l _ji .

図２の右側は、テキスト集合（文書集合）をＮＭＦを用いて行列表現したものである。文書×単語の行列Ｄは、文書×トピックの行列Θ、トピック×単語の行列Ｔに分解される。文書×品詞の行列Ｓは、文書×トピックの行列Θ、トピック×品詞の行列Ｇに分解される。行列Ｄ，Ｓで行列Θを共有している。 The right side of FIG. 2 is a matrix representation of a text set (document set) using NMF. The document x word matrix D is decomposed into the document x topic matrix Θ and the topic x word matrix T. The document × part of speech matrix S is decomposed into the document × topic matrix Θ and the topic × part of speech matrix G. The matrix D and S share the matrix Θ.

図３は、各テキストをグラフィカルモデルに出現するノード（変数）をｓｋｉｐ－ｇｒａｍ（ｗｉｎｄｏｗｓｉｚｅ＝２）を用いて行列表現したものである。 FIG. 3 is a matrix representation of the nodes (variables) in which each text appears in the graphical model using skip-gram (window size = 2).

図３の左側に各テキストのグラフィカルモデルを示し、右側に３つのｓｋｉｐ－ｇｒａｍを示している。左のｓｋｉｐ－ｇｒａｍは、位置ｔにある単語ｗ_tから前後２番目以内の位置にある単語ｗ_t-2，ｗ_t-1，ｗ_t+1，ｗ_t+2を予測する。真ん中のｓｋｉｐ－ｇｒａｍは、位置ｔにある品詞ｌ_tから前後２番目以内の位置にある品詞ｌ_t-2，ｌ_t-1，ｌ_t+1，ｌ_t+2を予測する。右のｓｋｉｐ－ｇｒａｍは、位置ｔにある品詞ｌ_tから前後２番目以内の位置にある単語ｗ_t-2，ｗ_t-1，ｗ_t+1，ｗ_t+2を予測する。 A graphical model of each text is shown on the left side of FIG. 3, and three skip-grams are shown on the right side. The skip-gram on the left predicts the words w _t-2 , w _t-1 , w _{t + 1} , w _{t + 2} , which are within the second before and after the word w _t at the position t. The skip-gram in the middle predicts the part of speech l _t-2 , l _t-1 , l _{t + 1} , l _{t + 2} , which is within the second before and after the part of speech l _t at the position t. The skip-gram on the right predicts the words w _t-2 , w _t-1 , w _{t + 1} , w _{t + 2} , which are within the second before and after the part of speech l _t at position t.

ｓｋｉｐ－ｇｒａｍの学習はＰＭＩ行列の分解に等価である。モデルの学習では、共起行列Ｗ、共起行列Ｅ、および共起行列ＲのｓｈｉｆｔｅｄｐｏｓｉｔｉｖｅＰＭＩ（ＳＰＰＭＩ）を用いて、共起行列Ｗを単語の埋め込み行列Ｂと文脈語の埋め込み行列Ｃに分解し、共起行列Ｅを品詞の埋め込み行列Ｑと文脈に沿った品詞の埋め込み行列Ｈに分解し、共起行列Ｒを品詞の埋め込み行列Ｑと文脈語の埋め込み行列Ｃに分解する。 Learning the skip-gram is equivalent to the decomposition of the PMI matrix. In model training, the co-occurrence matrix W, the co-occurrence matrix E, and the shifted positive PMI (SPPMI) of the co-occurrence matrix R are used to decompose the co-occurrence matrix W into a word embedded matrix B and a context word embedded matrix C. Then, the co-occurrence matrix E is decomposed into the embedded matrix Q of the part words and the embedded matrix H of the part words according to the context, and the co-occurrence matrix R is decomposed into the embedded matrix Q of the part words and the embedded matrix C of the context words.

本実施形態では、さらに、図２で得られた行列Ｔ，ＧについてもＳＰＰＭＩを用いて、トピック×単語の行列Ｔをトピックの埋め込み行列Ａと文脈語の埋め込み行列Ｃに分解し、トピック×品詞の行列Ｇをトピックの埋め込み行列Ａと文脈に沿った品詞の埋め込み行列Ｈに分解する。 In the present embodiment, the matrix T and G obtained in FIG. 2 are further decomposed into the topic embedded matrix A and the context word embedded matrix C by using SPPMI to decompose the topic × word matrix T into the topic × part words. The matrix G of is decomposed into the embedded matrix A of the topic and the embedded matrix H of the part words according to the context.

学習部１０は、前記文書－トピック行列Θ、前記トピック－単語行列Ｔ、前記トピック－品詞行列Ｇ、トピックの埋め込み行列Ａ、単語の埋め込み行列Ｂ、文脈語の埋め込み行列Ｃ、文脈に沿った品詞の埋め込み行列Ｈ、および品詞の埋め込み行列Ｑを求めることで、トピックモデルとｓｋｉｐ－ｇｒａｍを結びつけてテキスト群に含まれる文書及び単語の分散ベクトルを学習する。 The learning unit 10 includes the document-topic matrix Θ, the topic-word matrix T, the topic-part part matrix G, the topic embedding matrix A, the word embedding matrix B, the context word embedding matrix C, and the contextual part words. By finding the embedded matrix H of the above and the embedded matrix Q of the part words, the topic model and skip-gram are linked to learn the dispersion vector of the documents and words included in the text group.

［動作］
次に、図４のフローチャートを参照し、学習処理について説明する。説明で利用する記号とその説明を次表１に記載する。 [motion]
Next, the learning process will be described with reference to the flowchart of FIG. The symbols used in the explanation and their explanations are shown in Table 1 below.

ステップＳ１１にて、学習部１０は、データ保存部３０から処理対象のテキスト群を読み出して、形態素解析によりテキスト群を単語に分割する。学習部１０は、単語の品詞も取得しておく。 In step S11, the learning unit 10 reads the text group to be processed from the data storage unit 30, and divides the text group into words by morphological analysis. The learning unit 10 also acquires the part of speech of the word.

ステップＳ１２にて、学習部１０は、文書×単語の行列Ｄと文書×品詞の行列Ｓを作成する。行列ＤはＶ（単語の異なり数）×Ｎ（文書の数）からなる行列であり、行列の要素ｄ_jiは、文書ｉにおける単語ｊのｔｆ＊ｉｄｆ値を用いる。行列ＳはＯ（品詞の異なり数）×Ｎ（文書の数）からなる行列であり、行列の要素ｓ_jiは、文書ｉにおける品詞ｊのｔｆ＊ｉｄｆ値を用いる。 In step S12, the learning unit 10 creates a document × word matrix D and a document × part of speech matrix S. The matrix D is a matrix consisting of V (number of different words) × N (number of documents), and the element d _ji of the matrix uses the tf * idf value of the word j in the document i. The matrix S is a matrix consisting of O (number of different parts of speech) × N (number of documents), and the element s _ji of the matrix uses the tf * idf value of the part of speech j in the document i.

ステップＳ１３にて、学習部１０は、単語の共起行列Ｗ，品詞の共起行列Ｅ，および単語と品詞の共起行列Ｒを作成する。 In step S13, the learning unit 10 creates a word co-occurrence matrix W, a part-speech co-occurrence matrix E, and a word-part-speech co-occurrence matrix R.

単語の共起行列Ｗの（ｉ，ｊ）成分は、ＳＰＰＭＩを用いて次式で求められる。 The (i, j) component of the word co-occurrence matrix W can be obtained by the following equation using SPPMI.

共起行列Ｅ、Ｒも共起行列Ｗと同様に作成できる。 The co-occurrence matrices E and R can be created in the same manner as the co-occurrence matrix W.

ステップＳ１４にて、学習部１０は、モデルのパラメタを更新する。以下、パラメタの導出について説明する。 In step S14, the learning unit 10 updates the parameters of the model. The derivation of parameters will be described below.

図２で説明したように、本実施形態では、行列Ｄ，ＳをＮＭＦを用いたトピックモデルとして解釈し、Ｄ＝Θ×Ｔ、Ｓ＝Θ×Ｇとする。行列Ｔ、Ｇ、Θは、次式の目的関数Ｌ_D、Ｌ_Sを用いて導出できる。目的関数Ｌ_D、Ｌ_Sが最小となるような行列Ｔ、Ｇ、Θを求めることで行列Ｄ，Ｓを分解できる。 As described with reference to FIG. 2, in the present embodiment, the matrices D and S are interpreted as a topic model using NMF, and D = Θ × T and S = Θ × G. The matrices T, G, Θ can be derived using the objective functions L _D , _LS of the following equation. The matrices D and S can be decomposed by finding the matrices T, G and Θ that minimize the objective functions L _D and L _S.

なお、右辺の第２項は、オーバーフィッティングを防止するための正規化項であり、λ_d、λ_sは正規化項を調整するパラメタである。以降の数式も同様である。 The second term on the right side is a normalization term for preventing overfitting, and λ _d and λ _s are parameters for adjusting the normalization term. The same applies to the following formulas.

図３で説明したように、ｓｋｉｐ－ｇｒａｍの学習はステップＳ１３で作成した共起行列Ｗ，Ｅ，Ｒの分解に相当し、行列Ｂ、Ｃ、Ｑ、Ｈは、次式の目的関数Ｌ_W、Ｌ_E、Ｌ_Rを用いて導出できる。 As explained with reference to FIG. 3, learning of skip-gram corresponds to decomposition of co-occurrence matrices W, E, and R created in step S13, and the matrices B, C, Q, and H are objective functions L _W of the following equation. , _LE , L _R can be used for derivation.

また、行列Ｔ＝Ａ×Ｃ、行列Ｇ＝Ａ×Ｈと分解すると、行列Ａ、Ｃ、Ｈは、次式の目的関数Ｌ_T,Gを用いて導出できる。 Further, by decomposing the matrix T = A × C and the matrix G = A × H, the matrices A, C, and H can be derived using the objective functions LT _{and G} of the following equation.

上記の目的関数をまとめると、モデルの目的関数は次式となる。 Summarizing the above objective functions, the objective function of the model is as follows.

モデルの目的関数を展開すると次式となる。 Expanding the objective function of the model yields the following equation.

学習部１０の更新するパラメタは、モデルの目的関数から導出した次式となる。 The parameter to be updated by the learning unit 10 is the following equation derived from the objective function of the model.

続いて、図５のフローチャートを参照し、検索処理について説明する。 Subsequently, the search process will be described with reference to the flowchart of FIG.

ステップＳ２１にて、検索部２０は、ユーザ端末５から検索内容（クエリ）を受信する。クエリは、単語でもよいし、文書でもよい。 In step S21, the search unit 20 receives the search content (query) from the user terminal 5. The query can be a word or a document.

ステップＳ２２にて、検索部２０は、受信したクエリをキーとして計算結果記憶部４０に問い合わせ、該当する分散ベクトルを得る。 In step S22, the search unit 20 queries the calculation result storage unit 40 using the received query as a key, and obtains the corresponding variance vector.

ステップＳ２３にて、検索部２０は、計算結果記憶部４０からステップＳ２２で得た分散ベクトルに近い分散ベクトルを検索し、検索された近い分散ベクトルに対応する文書または単語をデータ保存部３０から取得する。分散ベクトル間の距離はコサイン類似度を使って測定できる。 In step S23, the search unit 20 searches the calculation result storage unit 40 for a variance vector close to the variance vector obtained in step S22, and acquires a document or word corresponding to the searched close variance vector from the data storage unit 30. do. The distance between the variance vectors can be measured using the cosine similarity.

ステップＳ２４にて、検索部２０は、ステップＳ２３で取得した単語または文書を検索結果としてユーザ端末５へ返却する。 In step S24, the search unit 20 returns the word or document acquired in step S23 to the user terminal 5 as a search result.

［検証］
ＱＡシステムの適合率向上に対する本実施形態の検索システムの有効性を評価するために、現行のヘルプデスクシステムと本実施形態の検索システムの検索機能を検証した。 [inspection]
In order to evaluate the effectiveness of the search system of this embodiment for improving the conformity rate of the QA system, the search functions of the current help desk system and the search system of this embodiment were verified.

検索データとして、ＱＡデータを７００件程度用意し、検索内容として、質問と質問に対する正解を１５０件程度用意した。現行システムに、検索データを一括登録で登録し、質問を随時検索して、アウトプットの５番目までを取得した。本実施形態の検索システムに、検索データを学習させて、質問を随時検索して、アウトプットの５番目までを取得した。１番目から５番目までのアウトプットの中で質問に対する正解の位置に応じて５点、４点、３点、２点、１点を付与した。例えば、アウトプットの１番目が質問に対する正解であったときは５点を付与する。アウトプットの中に正解が含まれないときは０点とする。次表２に検証結果を示す。 About 700 QA data were prepared as search data, and about 150 questions and correct answers to the questions were prepared as search contents. Search data was registered in the current system by batch registration, questions were searched at any time, and up to the fifth output was acquired. The search system of the present embodiment was made to learn the search data, the questions were searched at any time, and up to the fifth output was acquired. Among the 1st to 5th outputs, 5 points, 4 points, 3 points, 2 points, and 1 point were given according to the position of the correct answer to the question. For example, if the first output is the correct answer to the question, 5 points will be given. If the output does not include the correct answer, it will be given 0 points. The verification results are shown in Table 2 below.

本実施形態の検索システムは、現行システムより正解合計点数が多く、適合率が高いといえる。また、本実施形態の検索システムは、現行システムより０点の数が少なく、再現率が高いといえる。適合率とは、正解データとの一致度合いを表す、正確性に関する指標である。再現率とは、正解データを網羅する度合いを表す、網羅性に関する指標である。 It can be said that the search system of the present embodiment has a higher total number of correct answers and a higher matching rate than the current system. Further, it can be said that the search system of the present embodiment has a smaller number of 0 points and a higher reproducibility than the current system. The precision rate is an index related to accuracy, which indicates the degree of agreement with the correct answer data. The reproducibility is an index related to completeness, which indicates the degree of coverage of correct answer data.

以上説明したように、本実施形態の検索システム１は、学習部１０と検索部２０を備える。学習部１０は、処理対象のテキスト群を入力し、テキスト群の各テキストの単語および品詞の重要度を表す行列Ｄ，Ｓをトピックを基底としてＮＭＦにより分解したトピックモデルとテキスト群の単語の共起に基づくスキップグラムモデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する。検索部２０は、単語または文書をモデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する。このように、検索システム１が文書および単語の分散ベクトルを同時に学習することで、文書と単語を同じ意味空間で表現できる。その結果、キーワードが一致するだけでなく、意味が近いものまで検索でき、検索効率の向上が期待できる。また、適切なキーワードが想起できないときに、そのキーワードを表す文書（話し言葉）を入力しても適切な検索結果が得られる。 As described above, the search system 1 of the present embodiment includes a learning unit 10 and a search unit 20. The learning unit 10 inputs the text group to be processed, and co-occurrence the topic model and the words of the text group, which are decomposed by NMF based on the matrices D and S representing the importance of the words and part of speech of each text of the text group. Simultaneously learns a skipgram model based on co-occurrence to generate a model that outputs a variance vector when a document or word is input. The search unit 20 inputs a word or a document into the model, and outputs a word or a document having a variance vector close to the obtained variance vector as a search result. In this way, the search system 1 can simultaneously learn the dispersion vector of the document and the word, so that the document and the word can be expressed in the same semantic space. As a result, not only the keywords match, but also those with similar meanings can be searched, and improvement in search efficiency can be expected. In addition, when an appropriate keyword cannot be recalled, appropriate search results can be obtained by inputting a document (spoken language) representing the keyword.

本実施形態の検索システム１は、単語および品詞のｔｆ＊ｉｄｆ値を用いて、各テキストの単語および品詞の重要度を表す行列を生成することにより、文書を構成する単語の重みが考慮されるので、文書検索の精度の向上が期待できる。 The search system 1 of the present embodiment considers the weights of the words constituting the document by generating a matrix representing the importance of the words and parts of speech in each text using the tf * idf values of the words and parts of speech. Therefore, it can be expected that the accuracy of document retrieval will be improved.

本実施形態の検索システム１は、トピックおよび品詞を組み合わせてモデルを学習するので、反意語が類似した分散ベクトルを持つことを抑制できる。 Since the search system 1 of the present embodiment learns the model by combining topics and part of speech, it is possible to suppress that the antonyms have similar variance vectors.

１…検索システム
１０…学習部
１１…事前処理部
１２…計算処理部
２０…検索部
３０…データ保存部
４０…計算結果記憶部
1 ... Search system 10 ... Learning unit 11 ... Pre-processing unit 12 ... Calculation processing unit 20 ... Search unit 30 ... Data storage unit 40 ... Calculation result storage unit

Claims

The first model in which the text group to be processed is input and the matrix representing the importance of the words in each text group is decomposed into non-negative matrix factors based on the topic, and the second model based on the co-occurrence of the words in the text group. A learning unit that learns a model at the same time and generates a model that outputs a variance vector when a document or word is input.
A search device including a search unit that inputs a word or a document into the model and outputs a word or a document having a variance vector close to the obtained variance vector as a search result.

The search device according to claim 1.
In the first model, the matrix representing the importance of words in each text is decomposed into the document-topic matrix and the topic-word matrix, and the matrix representing the importance of the part of speech of each text is decomposed into the document-topic matrix and topic-part of speech matrix. It is a model disassembled into
The second model is a model in which the co-occurrence of words in the text group is expressed by a co-occurrence matrix of words, a co-occurrence matrix of parts of speech, and a co-occurrence matrix of words and parts of speech.
The learning unit inputs the text group to be processed, and decomposes the document-topic matrix, the topic-word matrix, the topic-participation matrix, and the co-occurrence matrix of the words into a word embedded matrix derived. Embedded matrix of contextual words, embedded matrix of part words derived by decomposing the co-occurrence matrix of the part words, embedded matrix of part words according to the context, embedded matrix of part words derived by decomposing the co-occurrence matrix of the word and part words And the embedded matrix of the context word, the embedded matrix of the topic derived by decomposing the topic-word matrix and the embedded matrix of the context word, the embedded matrix of the topic derived by decomposing the topic-participatory matrix, and the participatory in line with the context. A search device that calculates the embedded matrix of the above and generates the model.

The search device according to claim 2.
The learning unit generates a matrix representing the importance of the word of each text based on the frequency of occurrence of the word, and generates a matrix representing the importance of the part of speech of each text based on the frequency of occurrence of the part of speech.
The learning unit generates a co-occurrence matrix of the word based on the self-mutual information between words, generates a co-occurrence matrix of the part based on the self-mutual information between parts, and generates a co-occurrence matrix between words and parts. A search device that generates a co-occurrence matrix of the word and part of the word based on the amount of self-mutual information.

The computer
Enter the text group to be processed and
Documents by simultaneously learning the first model, which is a non-negative matrix factor decomposition based on the topic, and the second model, which is based on the co-occurrence of words in the text group. Or generate a model that outputs a variance vector when you enter a word,
A search method in which a word or document is input to the model and a word or document having a variance vector close to the obtained variance vector is output as a search result.

The search method according to claim 4.
In the first model, the matrix representing the importance of words in each text is decomposed into the document-topic matrix and the topic-word matrix, and the matrix representing the importance of the part of speech of each text is decomposed into the document-topic matrix and topic-part of speech matrix. It is a model disassembled into
The second model is a model in which the co-occurrence of words in the text group is expressed by a co-occurrence matrix of words, a co-occurrence matrix of parts of speech, and a co-occurrence matrix of words and parts of speech.
The computer inputs the text group to be processed, inputs the document-topic matrix, the topic-word matrix, the topic-participation matrix, and the text group to be processed, and decomposes the co-occurrence matrix of the words. The embedded matrix of the word and the embedded matrix of the context word derived by The embedded matrix of part words and the embedded matrix of the context word derived by decomposition, the embedded matrix of the topic derived by decomposing the topic-word matrix and the embedded matrix of the context word, and the topic derived by decomposing the topic-partial word matrix. A search method that generates the model by finding the embedded matrix of the part of the word and the embedded matrix of the part words according to the context.

The search method according to claim 5.
The computer
A matrix representing the importance of the words in each of the texts is generated based on the frequency of the words.
A matrix representing the importance of the part of speech in each of the texts is generated based on the frequency of appearance of the part of speech.
A co-occurrence matrix of the words is generated based on the amount of self-mutual information between words.
A co-occurrence matrix of the part of speech is generated based on the amount of self-mutual information between the part of speech.
A search method that generates a co-occurrence matrix of the word and part of speech based on the amount of self-mutual information between the word and part of speech.

A program that operates a computer as each part of the search device according to any one of claims 1 to 3.