JP7012811B1 - Search device, search method, and program - Google Patents

Search device, search method, and program Download PDF

Info

Publication number
JP7012811B1
JP7012811B1 JP2020205794A JP2020205794A JP7012811B1 JP 7012811 B1 JP7012811 B1 JP 7012811B1 JP 2020205794 A JP2020205794 A JP 2020205794A JP 2020205794 A JP2020205794 A JP 2020205794A JP 7012811 B1 JP7012811 B1 JP 7012811B1
Authority
JP
Japan
Prior art keywords
matrix
word
words
topic
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2020205794A
Other languages
Japanese (ja)
Other versions
JP2022092849A (en
Inventor
徳章 川前
Original Assignee
エヌ・ティ・ティ・コムウェア株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by エヌ・ティ・ティ・コムウェア株式会社 filed Critical エヌ・ティ・ティ・コムウェア株式会社
Priority to JP2020205794A priority Critical patent/JP7012811B1/en
Application granted granted Critical
Publication of JP7012811B1 publication Critical patent/JP7012811B1/en
Publication of JP2022092849A publication Critical patent/JP2022092849A/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

【課題】質問に対する回答の検索効率を向上する。【解決手段】検索システム1は、学習部10と検索部20を備える。学習部10は、処理対象のテキスト群を入力し、テキスト群の各テキストの単語および品詞の重要度を表す行列D,Sをトピックを基底としてNMFにより分解したトピックモデルとテキスト群の単語の共起に基づくスキップグラムモデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する。検索部20は、単語または文書をモデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する。【選択図】図1PROBLEM TO BE SOLVED: To improve the search efficiency of an answer to a question. A search system 1 includes a learning unit 10 and a search unit 20. The learning unit 10 inputs the text group to be processed, and co-occurrence the topic model and the words of the text group, which are decomposed by NMF based on the matrices D and S representing the importance of the words and part of speech of each text of the text group. Simultaneously learn with a skipgram model based on co-occurrence to generate a model that outputs a variance vector when a document or word is input. The search unit 20 inputs a word or a document into the model, and outputs a word or a document having a variance vector close to the obtained variance vector as a search result. [Selection diagram] Fig. 1

Description

本発明は、検索装置、検索方法、およびプログラムに関する。 The present invention relates to a search device, a search method, and a program.

既存の検索システムは、データベース検索の延長にあり、キーワード一致したものを検索する。例えば、分散埋め込み表現(分散ベクトル)を用いる検索手法では、単語あるいは文書をそれぞれ分散ベクトルで表現し、そのベクトルの近さが意味の近さになるように学習したベクトルを使って検索を実行する。これにより、単語あるいは文書に対して意味的に近い単語や文書をベクトル演算で検索できるようになる。 The existing search system is an extension of the database search and searches for keyword matches. For example, in a search method using a distributed embedded expression (distributed vector), a word or a document is represented by a distributed vector, and a search is executed using a vector learned so that the closeness of the vector becomes close to the meaning. .. This makes it possible to search for words or documents that are semantically close to the words or documents by vector operation.

国際公開第2018/151125号公報International Publication No. 2018/151125

しかしながら、既存の分散ベクトルは単語の前後情報のみを用いて学習するので、反意語でも類似した分散ベクトルを持つことがある。また、文書は単語より構成されるが、全ての単語が同様の重みをもつわけではないにもかかわらず、それらを考慮して文書の分散ベクトルを学習していないため、文書検索の精度が落ちるという課題がある。 However, since the existing variance vector is learned using only the information before and after the word, the antonym may have a similar variance vector. Also, although a document is composed of words, not all words have the same weight, but the dispersion vector of the document is not learned in consideration of them, so the accuracy of document retrieval is reduced. There is a problem.

本発明は、上記に鑑みてなされたものであり、質問に対する回答の検索効率を向上することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to improve the search efficiency of answers to questions.

本発明の一態様の検索装置は、処理対象のテキスト群を入力し、前記テキスト群の各テキストの単語の重要度を表す行列をトピックを基底として非負値行列因子分解した第1モデルと前記テキスト群の単語の共起に基づく第2モデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する学習部と、単語または文書を前記モデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する検索部を備える。 In the search device of one aspect of the present invention, a first model in which a text group to be processed is input and a matrix representing the importance of a word in each text of the text group is decomposed by a non-negative matrix factor based on a topic and the text. Obtained by learning a second model based on the co-occurrence of words in a group at the same time and inputting a word or document into the model with a learning unit that generates a model that outputs a dispersion vector when a document or word is input. It is provided with a search unit that outputs a word or document having a distribution vector close to the distribution vector as a search result.

本発明によれば、質問に対する回答の検索効率を向上することができる。 According to the present invention, it is possible to improve the search efficiency of answers to questions.

図1は、本実施形態の検索システムの構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of the search system of the present embodiment. 図2は、本実施形態のモデルを非負値行列因子分解により表現した図である。FIG. 2 is a diagram representing the model of the present embodiment by non-negative matrix factorization. 図3は、本実施形態のモデルをskip-gramにより表現した図である。FIG. 3 is a diagram showing the model of the present embodiment by skip-gram. 図4は、本実施形態の検索システムの学習処理の流れの一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the flow of the learning process of the search system of the present embodiment. 図5は、本実施形態の検索システムの検索処理の流れの一例を示すフローチャートである。FIG. 5 is a flowchart showing an example of the flow of the search process of the search system of the present embodiment.

[システム構成]
以下、本発明の実施の形態について図面を用いて説明する。
[System configuration]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図1は、本実施形態の検索システムの構成の一例を示す図である。同図に示す検索システム1は、学習部10、検索部20、データ保存部30、および計算結果記憶部40を備える。検索システム1が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは検索システム1が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。 FIG. 1 is a diagram showing an example of the configuration of the search system of the present embodiment. The search system 1 shown in the figure includes a learning unit 10, a search unit 20, a data storage unit 30, and a calculation result storage unit 40. Each part included in the search system 1 may be configured by a computer provided with an arithmetic processing unit, a storage device, and the like, and the processing of each part may be executed by a program. This program is stored in a storage device included in the search system 1, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or can be provided through a network.

学習部10は、処理対象のテキスト群を入力し、トピックモデル(Topic model)と埋め込みモデル(Embedding model)とを同時に学習して、文書集合を入力するとそれに含まれる文書及びその単語の分散ベクトルを生成する。トピックモデルとは、トピックに基づくテキスト単位のグローバルな単語の共起性を用いてテキストの生成過程をモデル化したものである。埋め込みモデルとは、各単語の周辺に出現する単語集合というローカルな情報を用いて単語の共起をモデル化したものである。トピックモデルと埋め込みモデルを結びつけることで、両モデルのメリットを享受したモデルを生成できる。 The learning unit 10 inputs a text group to be processed, learns a topic model (Topic model) and an embedded model (Embedding model) at the same time, and when a document set is input, the document included in the document and the dispersion vector of the word are input. Generate. The topic model is a model of the text generation process using the co-occurrence of global words in text units based on the topic. The embedded model is a model of word co-occurrence using local information such as a word set that appears around each word. By linking the topic model and the embedded model, it is possible to generate a model that enjoys the advantages of both models.

より具体的には、学習部10は、事前処理部11と計算処理部12を備えて、単語の重要度を表す行列をトピックを基底として非負値行列因子分解を用いて行列表現するとともに、単語の共起をskip-gramを用いて行列表現したモデルを生成する。 More specifically, the learning unit 10 includes a preprocessing unit 11 and a calculation processing unit 12, and expresses a matrix representing the importance of a word based on a topic by using non-negative matrix factorization, and also expresses the word. A model is generated in which the co-occurrence of is expressed in a matrix using skip-gram.

事前処理部11は、処理対象のテキスト群を形態素解析により単語に分割し、単語の頻出頻度に基づいて各テキストの単語および品詞の重要度を表す行列D,Sを生成するとともに、単語間、品詞間、および単語と品詞間の自己相互情報量(Pointwise Mutual Information:PMI)に基づいて単語の共起行列W、品詞の共起行列E、および単語と品詞の共起行列Rを生成する。 The preprocessing unit 11 divides the text group to be processed into words by morphological analysis, generates matrices D and S representing the importance of the words and parts of each text based on the frequency of occurrence of the words, and creates inter-word intervals. The co-occurrence matrix W of words, the co-occurrence matrix E of parts, and the co-occurrence matrix R of words and parts are generated based on the self-mutual information amount (Pointwise Mutual Information: PMI) between parts and between words and parts.

計算処理部12は、行列Dを文書-トピック行列Θとトピック-単語行列Tに分解し、行列Sを文書-トピック行列Θとトピック-品詞行列Gに分解して、行列Θ、行列T、および行列Gを求めるとともに、単語の共起行列Wを分解して導出した単語の埋め込み行列Bと文脈語の埋め込み行列C、品詞の共起行列Eを分解して導出した品詞の埋め込み行列Qと文脈に沿った品詞の埋め込み行列H、単語と品詞の共起行列Rを分解して導出した品詞の埋め込み行列Qと文脈語の埋め込み行列Cを求め、トピック-単語行列Tを分解して導出したトピックの埋め込み行列Aと文脈語の埋め込み行列C、トピック-品詞行列Gを分解して導出したトピックの埋め込み行列Aと文脈に沿った品詞の埋め込み行列Hを求める。提案モデルの詳細については後述する。 The calculation processing unit 12 decomposes the matrix D into a document-topic matrix Θ and a topic-word matrix T, decomposes the matrix S into a document-topic matrix Θ and a topic-part-timer matrix G, and decomposes the matrix D into a document-topic matrix Θ and a topic-part-timer matrix G, and then decomposes the matrix D into a document-topic matrix Θ and a topic-word matrix T. While finding the matrix G, the embedded matrix B of the word derived by decomposing the co-occurrence matrix W of the word, the embedded matrix C of the context word, and the embedded matrix Q of the part derived by decomposing the co-occurrence matrix E of the part of the word, and the context. The topic-word matrix T is decomposed and derived by finding the part-word embedded matrix Q and the context word embedded matrix C derived by decomposing the part-word embedded matrix H and the co-occurrence matrix R of the word and part. The embedded matrix A of the topic, the embedded matrix C of the context word, the embedded matrix A of the topic derived by decomposing the topic-participation matrix G, and the embedded matrix H of the part words according to the context are obtained. The details of the proposed model will be described later.

検索部20は、ユーザ端末5から受信した単語または文書をキーとして、モデルから得られた分散ベクトルと近い分散ベクトルを計算結果記憶部40から検索し、検索で得られた分散ベクトルを持つ単語または文書をデータ保存部30から取得して検索結果としてユーザ端末5へ返却する。 The search unit 20 searches the calculation result storage unit 40 for a variance vector close to the variance vector obtained from the model using the word or document received from the user terminal 5 as a key, and the word or word having the variance vector obtained by the search. The document is acquired from the data storage unit 30 and returned to the user terminal 5 as a search result.

検索部20は、例えば、ユーザ端末5から話し言葉で検索内容を入力されたときに、話し言葉をキーとして意味的に近い単語を検索し、検索で得られた単語を検索キーワードとしてモデルに入力して検索結果を得てもよい。 For example, when the search content is input in spoken language from the user terminal 5, the search unit 20 searches for words that are semantically close to each other using the spoken language as a key, and inputs the word obtained in the search into the model as a search keyword. Search results may be obtained.

データ保存部30は、検索対象の文書と単語を保持する。データ保存部30の保持する文書は処理対象のテキスト群として学習部10のモデルの学習にも用いられる。 The data storage unit 30 holds the document and the word to be searched. The document held by the data storage unit 30 is also used for learning the model of the learning unit 10 as a text group to be processed.

計算結果記憶部40は、文書および単語の分散ベクトルを保持する。文書および単語の分散ベクトルは、文書および単語を学習部10の生成したモデルに入力することで得られる。計算結果記憶部40は、文書および単語の意味空間のインデックスであり、空間内での距離の近さは意味の近さになっている。 The calculation result storage unit 40 holds the dispersion vector of the document and the word. The dispersion vector of the document and the word is obtained by inputting the document and the word into the model generated by the learning unit 10. The calculation result storage unit 40 is an index of the semantic space of a document and a word, and the closeness of the distance in the space is the closeness of the meaning.

[提案モデル]
ここで、本実施形態で提案するモデルについて説明する。
[Proposed model]
Here, the model proposed in this embodiment will be described.

図2は、各テキストのグラフィカルモデルを非負値行列因子分解(Non-negative Matrix Factorization:NMF)を用いて行列表現した様子を示す図である。 FIG. 2 is a diagram showing a matrix representation of a graphical model of each text using non-negative matrix factorization (NMF).

図2の左側は、各テキストのグラフィカルモデルである。グラフィカルモデルはノードとエッジから構成され、ノードは変数、エッジの方向はノード間の因果関係を示す。図2のグラフィカルモデルでは、文書djのトークンiにおけるトピックzjiから品詞ljiが確率的にサンプリングされ、トピックzjiと品詞ljiから単語wjiがサンプリングされる様子を示している。 The left side of FIG. 2 is a graphical model of each text. The graphical model consists of nodes and edges, where the nodes are variables and the direction of the edges shows the causal relationship between the nodes. The graphical model of FIG. 2 shows how the part of speech l ji is stochastically sampled from the topic z ji in the token i of the document d j , and the word w ji is sampled from the topic z ji and the part of speech l ji .

図2の右側は、テキスト集合(文書集合)をNMFを用いて行列表現したものである。文書×単語の行列Dは、文書×トピックの行列Θ、トピック×単語の行列Tに分解される。文書×品詞の行列Sは、文書×トピックの行列Θ、トピック×品詞の行列Gに分解される。行列D,Sで行列Θを共有している。 The right side of FIG. 2 is a matrix representation of a text set (document set) using NMF. The document x word matrix D is decomposed into the document x topic matrix Θ and the topic x word matrix T. The document × part of speech matrix S is decomposed into the document × topic matrix Θ and the topic × part of speech matrix G. The matrix D and S share the matrix Θ.

図3は、各テキストをグラフィカルモデルに出現するノード(変数)をskip-gram(window size=2)を用いて行列表現したものである。 FIG. 3 is a matrix representation of the nodes (variables) in which each text appears in the graphical model using skip-gram (window size = 2).

図3の左側に各テキストのグラフィカルモデルを示し、右側に3つのskip-gramを示している。左のskip-gramは、位置tにある単語wtから前後2番目以内の位置にある単語wt-2,wt-1,wt+1,wt+2を予測する。真ん中のskip-gramは、位置tにある品詞ltから前後2番目以内の位置にある品詞lt-2,lt-1,lt+1,lt+2を予測する。右のskip-gramは、位置tにある品詞ltから前後2番目以内の位置にある単語wt-2,wt-1,wt+1,wt+2を予測する。 A graphical model of each text is shown on the left side of FIG. 3, and three skip-grams are shown on the right side. The skip-gram on the left predicts the words w t-2 , w t-1 , w t + 1 , w t + 2 , which are within the second before and after the word w t at the position t. The skip-gram in the middle predicts the part of speech l t-2 , l t-1 , l t + 1 , l t + 2 , which is within the second before and after the part of speech l t at the position t. The skip-gram on the right predicts the words w t-2 , w t-1 , w t + 1 , w t + 2 , which are within the second before and after the part of speech l t at position t.

skip-gramの学習はPMI行列の分解に等価である。モデルの学習では、共起行列W、共起行列E、および共起行列Rのshifted positive PMI(SPPMI)を用いて、共起行列Wを単語の埋め込み行列Bと文脈語の埋め込み行列Cに分解し、共起行列Eを品詞の埋め込み行列Qと文脈に沿った品詞の埋め込み行列Hに分解し、共起行列Rを品詞の埋め込み行列Qと文脈語の埋め込み行列Cに分解する。 Learning the skip-gram is equivalent to the decomposition of the PMI matrix. In model training, the co-occurrence matrix W, the co-occurrence matrix E, and the shifted positive PMI (SPPMI) of the co-occurrence matrix R are used to decompose the co-occurrence matrix W into a word embedded matrix B and a context word embedded matrix C. Then, the co-occurrence matrix E is decomposed into the embedded matrix Q of the part words and the embedded matrix H of the part words according to the context, and the co-occurrence matrix R is decomposed into the embedded matrix Q of the part words and the embedded matrix C of the context words.

本実施形態では、さらに、図2で得られた行列T,GについてもSPPMIを用いて、トピック×単語の行列Tをトピックの埋め込み行列Aと文脈語の埋め込み行列Cに分解し、トピック×品詞の行列Gをトピックの埋め込み行列Aと文脈に沿った品詞の埋め込み行列Hに分解する。 In the present embodiment, the matrix T and G obtained in FIG. 2 are further decomposed into the topic embedded matrix A and the context word embedded matrix C by using SPPMI to decompose the topic × word matrix T into the topic × part words. The matrix G of is decomposed into the embedded matrix A of the topic and the embedded matrix H of the part words according to the context.

学習部10は、前記文書-トピック行列Θ、前記トピック-単語行列T、前記トピック-品詞行列G、トピックの埋め込み行列A、単語の埋め込み行列B、文脈語の埋め込み行列C、文脈に沿った品詞の埋め込み行列H、および品詞の埋め込み行列Qを求めることで、トピックモデルとskip-gramを結びつけてテキスト群に含まれる文書及び単語の分散ベクトルを学習する。 The learning unit 10 includes the document-topic matrix Θ, the topic-word matrix T, the topic-part part matrix G, the topic embedding matrix A, the word embedding matrix B, the context word embedding matrix C, and the contextual part words. By finding the embedded matrix H of the above and the embedded matrix Q of the part words, the topic model and skip-gram are linked to learn the dispersion vector of the documents and words included in the text group.

[動作]
次に、図4のフローチャートを参照し、学習処理について説明する。説明で利用する記号とその説明を次表1に記載する。
[motion]
Next, the learning process will be described with reference to the flowchart of FIG. The symbols used in the explanation and their explanations are shown in Table 1 below.

Figure 0007012811000002
Figure 0007012811000002

ステップS11にて、学習部10は、データ保存部30から処理対象のテキスト群を読み出して、形態素解析によりテキスト群を単語に分割する。学習部10は、単語の品詞も取得しておく。 In step S11, the learning unit 10 reads the text group to be processed from the data storage unit 30, and divides the text group into words by morphological analysis. The learning unit 10 also acquires the part of speech of the word.

ステップS12にて、学習部10は、文書×単語の行列Dと文書×品詞の行列Sを作成する。行列DはV(単語の異なり数)×N(文書の数)からなる行列であり、行列の要素djiは、文書iにおける単語jのtf*idf値を用いる。行列SはO(品詞の異なり数)×N(文書の数)からなる行列であり、行列の要素sjiは、文書iにおける品詞jのtf*idf値を用いる。 In step S12, the learning unit 10 creates a document × word matrix D and a document × part of speech matrix S. The matrix D is a matrix consisting of V (number of different words) × N (number of documents), and the element d ji of the matrix uses the tf * idf value of the word j in the document i. The matrix S is a matrix consisting of O (number of different parts of speech) × N (number of documents), and the element s ji of the matrix uses the tf * idf value of the part of speech j in the document i.

ステップS13にて、学習部10は、単語の共起行列W,品詞の共起行列E,および単語と品詞の共起行列Rを作成する。 In step S13, the learning unit 10 creates a word co-occurrence matrix W, a part-speech co-occurrence matrix E, and a word-part-speech co-occurrence matrix R.

単語の共起行列Wの(i,j)成分は、SPPMIを用いて次式で求められる。 The (i, j) component of the word co-occurrence matrix W can be obtained by the following equation using SPPMI.

Figure 0007012811000003
Figure 0007012811000003

共起行列E、Rも共起行列Wと同様に作成できる。 The co-occurrence matrices E and R can be created in the same manner as the co-occurrence matrix W.

ステップS14にて、学習部10は、モデルのパラメタを更新する。以下、パラメタの導出について説明する。 In step S14, the learning unit 10 updates the parameters of the model. The derivation of parameters will be described below.

図2で説明したように、本実施形態では、行列D,SをNMFを用いたトピックモデルとして解釈し、D=Θ×T、S=Θ×Gとする。行列T、G、Θは、次式の目的関数LD、LSを用いて導出できる。目的関数LD、LSが最小となるような行列T、G、Θを求めることで行列D,Sを分解できる。 As described with reference to FIG. 2, in the present embodiment, the matrices D and S are interpreted as a topic model using NMF, and D = Θ × T and S = Θ × G. The matrices T, G, Θ can be derived using the objective functions L D , LS of the following equation. The matrices D and S can be decomposed by finding the matrices T, G and Θ that minimize the objective functions L D and L S.

Figure 0007012811000004
Figure 0007012811000004

なお、右辺の第2項は、オーバーフィッティングを防止するための正規化項であり、λd、λsは正規化項を調整するパラメタである。以降の数式も同様である。 The second term on the right side is a normalization term for preventing overfitting, and λ d and λ s are parameters for adjusting the normalization term. The same applies to the following formulas.

図3で説明したように、skip-gramの学習はステップS13で作成した共起行列W,E,Rの分解に相当し、行列B、C、Q、Hは、次式の目的関数LW、LE、LRを用いて導出できる。 As explained with reference to FIG. 3, learning of skip-gram corresponds to decomposition of co-occurrence matrices W, E, and R created in step S13, and the matrices B, C, Q, and H are objective functions L W of the following equation. , LE , L R can be used for derivation.

Figure 0007012811000005
Figure 0007012811000005

また、行列T=A×C、行列G=A×Hと分解すると、行列A、C、Hは、次式の目的関数LT,Gを用いて導出できる。 Further, by decomposing the matrix T = A × C and the matrix G = A × H, the matrices A, C, and H can be derived using the objective functions LT and G of the following equation.

Figure 0007012811000006
Figure 0007012811000006

上記の目的関数をまとめると、モデルの目的関数は次式となる。 Summarizing the above objective functions, the objective function of the model is as follows.

Figure 0007012811000007
Figure 0007012811000007

モデルの目的関数を展開すると次式となる。 Expanding the objective function of the model yields the following equation.

Figure 0007012811000008
Figure 0007012811000008

学習部10の更新するパラメタは、モデルの目的関数から導出した次式となる。 The parameter to be updated by the learning unit 10 is the following equation derived from the objective function of the model.

Figure 0007012811000009
Figure 0007012811000009

続いて、図5のフローチャートを参照し、検索処理について説明する。 Subsequently, the search process will be described with reference to the flowchart of FIG.

ステップS21にて、検索部20は、ユーザ端末5から検索内容(クエリ)を受信する。クエリは、単語でもよいし、文書でもよい。 In step S21, the search unit 20 receives the search content (query) from the user terminal 5. The query can be a word or a document.

ステップS22にて、検索部20は、受信したクエリをキーとして計算結果記憶部40に問い合わせ、該当する分散ベクトルを得る。 In step S22, the search unit 20 queries the calculation result storage unit 40 using the received query as a key, and obtains the corresponding variance vector.

ステップS23にて、検索部20は、計算結果記憶部40からステップS22で得た分散ベクトルに近い分散ベクトルを検索し、検索された近い分散ベクトルに対応する文書または単語をデータ保存部30から取得する。分散ベクトル間の距離はコサイン類似度を使って測定できる。 In step S23, the search unit 20 searches the calculation result storage unit 40 for a variance vector close to the variance vector obtained in step S22, and acquires a document or word corresponding to the searched close variance vector from the data storage unit 30. do. The distance between the variance vectors can be measured using the cosine similarity.

ステップS24にて、検索部20は、ステップS23で取得した単語または文書を検索結果としてユーザ端末5へ返却する。 In step S24, the search unit 20 returns the word or document acquired in step S23 to the user terminal 5 as a search result.

[検証]
QAシステムの適合率向上に対する本実施形態の検索システムの有効性を評価するために、現行のヘルプデスクシステムと本実施形態の検索システムの検索機能を検証した。
[inspection]
In order to evaluate the effectiveness of the search system of this embodiment for improving the conformity rate of the QA system, the search functions of the current help desk system and the search system of this embodiment were verified.

検索データとして、QAデータを700件程度用意し、検索内容として、質問と質問に対する正解を150件程度用意した。現行システムに、検索データを一括登録で登録し、質問を随時検索して、アウトプットの5番目までを取得した。本実施形態の検索システムに、検索データを学習させて、質問を随時検索して、アウトプットの5番目までを取得した。1番目から5番目までのアウトプットの中で質問に対する正解の位置に応じて5点、4点、3点、2点、1点を付与した。例えば、アウトプットの1番目が質問に対する正解であったときは5点を付与する。アウトプットの中に正解が含まれないときは0点とする。次表2に検証結果を示す。 About 700 QA data were prepared as search data, and about 150 questions and correct answers to the questions were prepared as search contents. Search data was registered in the current system by batch registration, questions were searched at any time, and up to the fifth output was acquired. The search system of the present embodiment was made to learn the search data, the questions were searched at any time, and up to the fifth output was acquired. Among the 1st to 5th outputs, 5 points, 4 points, 3 points, 2 points, and 1 point were given according to the position of the correct answer to the question. For example, if the first output is the correct answer to the question, 5 points will be given. If the output does not include the correct answer, it will be given 0 points. The verification results are shown in Table 2 below.

Figure 0007012811000010
Figure 0007012811000010

本実施形態の検索システムは、現行システムより正解合計点数が多く、適合率が高いといえる。また、本実施形態の検索システムは、現行システムより0点の数が少なく、再現率が高いといえる。適合率とは、正解データとの一致度合いを表す、正確性に関する指標である。再現率とは、正解データを網羅する度合いを表す、網羅性に関する指標である。 It can be said that the search system of the present embodiment has a higher total number of correct answers and a higher matching rate than the current system. Further, it can be said that the search system of the present embodiment has a smaller number of 0 points and a higher reproducibility than the current system. The precision rate is an index related to accuracy, which indicates the degree of agreement with the correct answer data. The reproducibility is an index related to completeness, which indicates the degree of coverage of correct answer data.

以上説明したように、本実施形態の検索システム1は、学習部10と検索部20を備える。学習部10は、処理対象のテキスト群を入力し、テキスト群の各テキストの単語および品詞の重要度を表す行列D,Sをトピックを基底としてNMFにより分解したトピックモデルとテキスト群の単語の共起に基づくスキップグラムモデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する。検索部20は、単語または文書をモデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する。このように、検索システム1が文書および単語の分散ベクトルを同時に学習することで、文書と単語を同じ意味空間で表現できる。その結果、キーワードが一致するだけでなく、意味が近いものまで検索でき、検索効率の向上が期待できる。また、適切なキーワードが想起できないときに、そのキーワードを表す文書(話し言葉)を入力しても適切な検索結果が得られる。 As described above, the search system 1 of the present embodiment includes a learning unit 10 and a search unit 20. The learning unit 10 inputs the text group to be processed, and co-occurrence the topic model and the words of the text group, which are decomposed by NMF based on the matrices D and S representing the importance of the words and part of speech of each text of the text group. Simultaneously learns a skipgram model based on co-occurrence to generate a model that outputs a variance vector when a document or word is input. The search unit 20 inputs a word or a document into the model, and outputs a word or a document having a variance vector close to the obtained variance vector as a search result. In this way, the search system 1 can simultaneously learn the dispersion vector of the document and the word, so that the document and the word can be expressed in the same semantic space. As a result, not only the keywords match, but also those with similar meanings can be searched, and improvement in search efficiency can be expected. In addition, when an appropriate keyword cannot be recalled, appropriate search results can be obtained by inputting a document (spoken language) representing the keyword.

本実施形態の検索システム1は、単語および品詞のtf*idf値を用いて、各テキストの単語および品詞の重要度を表す行列を生成することにより、文書を構成する単語の重みが考慮されるので、文書検索の精度の向上が期待できる。 The search system 1 of the present embodiment considers the weights of the words constituting the document by generating a matrix representing the importance of the words and parts of speech in each text using the tf * idf values of the words and parts of speech. Therefore, it can be expected that the accuracy of document retrieval will be improved.

本実施形態の検索システム1は、トピックおよび品詞を組み合わせてモデルを学習するので、反意語が類似した分散ベクトルを持つことを抑制できる。 Since the search system 1 of the present embodiment learns the model by combining topics and part of speech, it is possible to suppress that the antonyms have similar variance vectors.

1…検索システム
10…学習部
11…事前処理部
12…計算処理部
20…検索部
30…データ保存部
40…計算結果記憶部
1 ... Search system 10 ... Learning unit 11 ... Pre-processing unit 12 ... Calculation processing unit 20 ... Search unit 30 ... Data storage unit 40 ... Calculation result storage unit

Claims (7)

処理対象のテキスト群を入力し、前記テキスト群の各テキストの単語の重要度を表す行列をトピックを基底として非負値行列因子分解した第1モデルと前記テキスト群の単語の共起に基づく第2モデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成する学習部と、
単語または文書を前記モデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する検索部を備える
検索装置。
The first model in which the text group to be processed is input and the matrix representing the importance of the words in each text group is decomposed into non-negative matrix factors based on the topic, and the second model based on the co-occurrence of the words in the text group. A learning unit that learns a model at the same time and generates a model that outputs a variance vector when a document or word is input.
A search device including a search unit that inputs a word or a document into the model and outputs a word or a document having a variance vector close to the obtained variance vector as a search result.
請求項1に記載の検索装置であって、
前記第1モデルは、各テキストの単語の重要度を表す行列を文書-トピック行列とトピック-単語行列に分解し、各テキストの品詞の重要度を表す行列を文書-トピック行列とトピック-品詞行列に分解したモデルであり、
前記第2モデルは、前記テキスト群の単語の共起を単語の共起行列、品詞の共起行列、および単語と品詞の共起行列で表現したモデルであって、
前記学習部は、前記処理対象のテキスト群を入力し、前記文書-トピック行列、前記トピック-単語行列、前記トピック-品詞行列、前記単語の共起行列を分解して導出した単語の埋め込み行列と文脈語の埋め込み行列、前記品詞の共起行列を分解して導出した品詞の埋め込み行列と文脈に沿った品詞の埋め込み行列、前記単語と品詞の共起行列を分解して導出した品詞の埋め込み行列と文脈語の埋め込み行列、前記トピック-単語行列を分解して導出したトピックの埋め込み行列と文脈語の埋め込み行列、前記トピック-品詞行列を分解して導出したトピックの埋め込み行列と文脈に沿った品詞の埋め込み行列を求めて前記モデルを生成する
検索装置。
The search device according to claim 1.
In the first model, the matrix representing the importance of words in each text is decomposed into the document-topic matrix and the topic-word matrix, and the matrix representing the importance of the part of speech of each text is decomposed into the document-topic matrix and topic-part of speech matrix. It is a model disassembled into
The second model is a model in which the co-occurrence of words in the text group is expressed by a co-occurrence matrix of words, a co-occurrence matrix of parts of speech, and a co-occurrence matrix of words and parts of speech.
The learning unit inputs the text group to be processed, and decomposes the document-topic matrix, the topic-word matrix, the topic-participation matrix, and the co-occurrence matrix of the words into a word embedded matrix derived. Embedded matrix of contextual words, embedded matrix of part words derived by decomposing the co-occurrence matrix of the part words, embedded matrix of part words according to the context, embedded matrix of part words derived by decomposing the co-occurrence matrix of the word and part words And the embedded matrix of the context word, the embedded matrix of the topic derived by decomposing the topic-word matrix and the embedded matrix of the context word, the embedded matrix of the topic derived by decomposing the topic-participatory matrix, and the participatory in line with the context. A search device that calculates the embedded matrix of the above and generates the model.
請求項2に記載の検索装置であって、
前記学習部は、単語の頻出頻度に基づいて前記各テキストの単語の重要度を表す行列を生成し、品詞の頻出頻度に基づいて前記各テキストの品詞の重要度を表す行列を生成し、
前記学習部は、単語間の自己相互情報量に基づいて前記単語の共起行列を生成し、品詞間の自己相互情報量に基づいて前記品詞の共起行列を生成し、単語と品詞間の自己相互情報量に基づいて前記単語と品詞の共起行列を生成する
検索装置。
The search device according to claim 2.
The learning unit generates a matrix representing the importance of the word of each text based on the frequency of occurrence of the word, and generates a matrix representing the importance of the part of speech of each text based on the frequency of occurrence of the part of speech.
The learning unit generates a co-occurrence matrix of the word based on the self-mutual information between words, generates a co-occurrence matrix of the part based on the self-mutual information between parts, and generates a co-occurrence matrix between words and parts. A search device that generates a co-occurrence matrix of the word and part of the word based on the amount of self-mutual information.
コンピュータが、
処理対象のテキスト群を入力し、
前記テキスト群の各テキストの単語の重要度を表す行列をトピックを基底として非負値行列因子分解した第1モデルと前記テキスト群の単語の共起に基づく第2モデルとを同時に学習して、文書または単語を入力すると分散ベクトルを出力するモデルを生成し、
単語または文書を前記モデルに入力し、得られた分散ベクトルと近い分散ベクトルを持つ単語または文書を検索結果として出力する
検索方法。
The computer
Enter the text group to be processed and
Documents by simultaneously learning the first model, which is a non-negative matrix factor decomposition based on the topic, and the second model, which is based on the co-occurrence of words in the text group. Or generate a model that outputs a variance vector when you enter a word,
A search method in which a word or document is input to the model and a word or document having a variance vector close to the obtained variance vector is output as a search result.
請求項4に記載の検索方法であって、
前記第1モデルは、各テキストの単語の重要度を表す行列を文書-トピック行列とトピック-単語行列に分解し、各テキストの品詞の重要度を表す行列を文書-トピック行列とトピック-品詞行列に分解したモデルであり、
前記第2モデルは、前記テキスト群の単語の共起を単語の共起行列、品詞の共起行列、および単語と品詞の共起行列で表現したモデルであって、
前記コンピュータは、前記処理対象のテキスト群を入力し、前記文書-トピック行列、前記トピック-単語行列、前記トピック-品詞行列、前記処理対象のテキスト群を入力し、前記単語の共起行列を分解して導出した単語の埋め込み行列と文脈語の埋め込み行列、前記品詞の共起行列を分解して導出した品詞の埋め込み行列と文脈に沿った品詞の埋め込み行列、前記単語と品詞の共起行列を分解して導出した品詞の埋め込み行列と文脈語の埋め込み行列、前記トピック-単語行列を分解して導出したトピックの埋め込み行列と文脈語の埋め込み行列、前記トピック-品詞行列を分解して導出したトピックの埋め込み行列と文脈に沿った品詞の埋め込み行列を求めて前記モデルを生成する
検索方法。
The search method according to claim 4.
In the first model, the matrix representing the importance of words in each text is decomposed into the document-topic matrix and the topic-word matrix, and the matrix representing the importance of the part of speech of each text is decomposed into the document-topic matrix and topic-part of speech matrix. It is a model disassembled into
The second model is a model in which the co-occurrence of words in the text group is expressed by a co-occurrence matrix of words, a co-occurrence matrix of parts of speech, and a co-occurrence matrix of words and parts of speech.
The computer inputs the text group to be processed, inputs the document-topic matrix, the topic-word matrix, the topic-participation matrix, and the text group to be processed, and decomposes the co-occurrence matrix of the words. The embedded matrix of the word and the embedded matrix of the context word derived by The embedded matrix of part words and the embedded matrix of the context word derived by decomposition, the embedded matrix of the topic derived by decomposing the topic-word matrix and the embedded matrix of the context word, and the topic derived by decomposing the topic-partial word matrix. A search method that generates the model by finding the embedded matrix of the part of the word and the embedded matrix of the part words according to the context.
請求項5に記載の検索方法であって、
前記コンピュータは、
単語の頻出頻度に基づいて前記各テキストの単語の重要度を表す行列を生成し、
品詞の頻出頻度に基づいて前記各テキストの品詞の重要度を表す行列を生成し、
単語間の自己相互情報量に基づいて前記単語の共起行列を生成し、
品詞間の自己相互情報量に基づいて前記品詞の共起行列を生成し、
単語と品詞間の自己相互情報量に基づいて前記単語と品詞の共起行列を生成する
検索方法。
The search method according to claim 5.
The computer
A matrix representing the importance of the words in each of the texts is generated based on the frequency of the words.
A matrix representing the importance of the part of speech in each of the texts is generated based on the frequency of appearance of the part of speech.
A co-occurrence matrix of the words is generated based on the amount of self-mutual information between words.
A co-occurrence matrix of the part of speech is generated based on the amount of self-mutual information between the part of speech.
A search method that generates a co-occurrence matrix of the word and part of speech based on the amount of self-mutual information between the word and part of speech.
請求項1ないし3のいずれかに記載の検索装置の各部としてコンピュータを動作させるプログラム。 A program that operates a computer as each part of the search device according to any one of claims 1 to 3.
JP2020205794A 2020-12-11 2020-12-11 Search device, search method, and program Active JP7012811B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2020205794A JP7012811B1 (en) 2020-12-11 2020-12-11 Search device, search method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2020205794A JP7012811B1 (en) 2020-12-11 2020-12-11 Search device, search method, and program

Publications (2)

Publication Number Publication Date
JP7012811B1 true JP7012811B1 (en) 2022-01-28
JP2022092849A JP2022092849A (en) 2022-06-23

Family

ID=80735325

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2020205794A Active JP7012811B1 (en) 2020-12-11 2020-12-11 Search device, search method, and program

Country Status (1)

Country Link
JP (1) JP7012811B1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015152983A (en) 2014-02-12 2015-08-24 日本電信電話株式会社 Topic modeling device, topic modeling method, and topic modeling program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015152983A (en) 2014-02-12 2015-08-24 日本電信電話株式会社 Topic modeling device, topic modeling method, and topic modeling program

Also Published As

Publication number Publication date
JP2022092849A (en) 2022-06-23

Similar Documents

Publication Publication Date Title
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
US11250033B2 (en) Methods, systems, and computer program product for implementing real-time classification and recommendations
US11086601B2 (en) Methods, systems, and computer program product for automatic generation of software application code
US10705796B1 (en) Methods, systems, and computer program product for implementing real-time or near real-time classification of digital data
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
US10467122B1 (en) Methods, systems, and computer program product for capturing and classification of real-time data and performing post-classification tasks
US7295965B2 (en) Method and apparatus for determining a measure of similarity between natural language sentences
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN112685550B (en) Intelligent question-answering method, intelligent question-answering device, intelligent question-answering server and computer readable storage medium
CN114896377A (en) Knowledge graph-based answer acquisition method
CN113821527A (en) Hash code generation method and device, computer equipment and storage medium
CN115374270A (en) Legal text abstract generation method based on graph neural network
US20220245361A1 (en) System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework
CN117370580A (en) Knowledge-graph-based large language model enhanced dual-carbon field service method
JP7012811B1 (en) Search device, search method, and program
CN116450855A (en) Knowledge graph-based reply generation strategy method and system for question-answering robot
Bachrach et al. An attention mechanism for neural answer selection using a combined global and local view
US20220237383A1 (en) Concept system for a natural language understanding (nlu) framework
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
Karpagam et al. Deep learning approaches for answer selection in question answering system for conversation agents
Lokman et al. A conceptual IR chatbot framework with automated keywords-based vector representation generation
CN114328820A (en) Information searching method and related equipment
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
Tamang et al. Adding smarter systems instead of human annotators: re-ranking for system combination
CN114722267A (en) Information pushing method and device and server

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20201211

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20220111

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20220118

R150 Certificate of patent or registration of utility model

Ref document number: 7012811

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250