JP6375210B2

JP6375210B2 - Model construction apparatus and program

Info

Publication number: JP6375210B2
Application number: JP2014229779A
Authority: JP
Inventors: 一則松本; 服部　元; 元服部; 滝嶋　康弘; 康弘滝嶋
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-11-12
Filing date: 2014-11-12
Publication date: 2018-08-15
Anticipated expiration: 2034-11-12
Also published as: JP2016095568A

Description

本発明は、トピック間の階層を考慮した潜在トピック分析を高速に実施することが可能なモデル構築装置及びプログラムに関する。 The present invention relates to a model construction apparatus and program capable of performing latent topic analysis considering a hierarchy between topics at high speed.

近年、文書や購買履歴などの離散データを解析する手法として、bag-of-words（バグオブワーズ）で表現された文書の生成過程を確率的にモデル化することで、直接は観測できない潜在的要因に基づいた高精度のクラスタリングを可能とするトピックモデルが注目されている。 In recent years, as a method of analyzing discrete data such as documents and purchase history, the generation process of documents expressed in bag-of-words (bug of words) is stochastically modeled, which is a potential factor that cannot be observed directly Topic models that enable high-accuracy clustering based on the topic are attracting attention.

トピックモデルの特徴は一つの文書が複数のトピックの混合として表現されることであり、その代表的手法である潜在的ディリクレ配分法(Latent Dirichlet Allocation；以下、LDAとする)は、情報検索、音声認識、QAシステムなど様々なデータマイニング分野に適用されている。ここで、LDAは非特許文献１に、LDAのQAシステムへの応用は特許文献１に、それぞれ開示されている。 A feature of the topic model is that a single document is expressed as a mixture of multiple topics. The typical method is latent dirichlet allocation (LDA), which is an information search, voice It is applied to various data mining fields such as recognition and QA system. Here, LDA is disclosed in Non-Patent Document 1, and LDA application to a QA system is disclosed in Patent Document 1, respectively.

また、特許文献２や非特許文献２に開示されているように、分類結果をより利用しやすくするため、トピックが階層構造を有する階層的トピックモデルも存在する。 In addition, as disclosed in Patent Document 2 and Non-Patent Document 2, there is a hierarchical topic model in which topics have a hierarchical structure in order to make the classification result easier to use.

特開2013-143066号公報JP 2013-143066 特開2013-134750号公報JP 2013-134750 A

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993- 1022, 2003.D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022, 2003. W Li, A McCallum , "Pachinko allocation: DAG-structured mixture models of topic correlations," Proc ICML, 2006W Li, A McCallum, "Pachinko allocation: DAG-structured mixture models of topic correlations," Proc ICML, 2006

非特許文献２（当該手法を「PAM」と略称する）では、トピックの階層構造を前提とし、パチンコが落ちるように、ギブスサンプリングを用いて、スーパートピックノードを上位から下位に決定して行き、文書に出現する各単語とトピックの関係を求めている。当該PAMの手法では、各文書とトピックの関係を求めるまでに多くの計算過程が必要になるという課題がある。 In Non-Patent Document 2 (this method is abbreviated as “PAM”), a super topic node is determined from the upper level to the lower level using Gibbs sampling so that the pachinko falls, assuming a hierarchical structure of topics. The relationship between each word appearing in the document and the topic is sought. The PAM method has a problem that many calculation processes are required to obtain the relationship between each document and topic.

図１は、当該トピックの階層構造の例を示す図である。上位側トピックとして「医療」及び「経済」があり、下位側トピックとして「再生医療」、「地方医療」、「新薬開発」、「金融危機」及び「貿易自由化」がある。階層関係を有する上位側ノードと下位側ノードとの間は、図示するように、矢印で接続されている。「新薬開発」については「医療」及び「経済」の両方に共通の下位側トピックとなっている。図１の例は上位及び下位の2層構造となっているが、一般にはn層構造(n≧2)も可能である。 FIG. 1 is a diagram illustrating an example of a hierarchical structure of the topic. Higher topics include “medicine” and “economy”, and lower topics include “regenerative medicine”, “local medicine”, “new drug development”, “financial crisis”, and “trade liberalization”. The upper and lower nodes having a hierarchical relationship are connected by arrows as shown in the figure. “New drug development” is a common low-level topic for both “medicine” and “economy”. Although the example of FIG. 1 has an upper and lower two-layer structure, an n-layer structure (n ≧ 2) is generally possible.

図２は、PAMにおいて当該階層構造を対象として各文書とトピックとの関係を求める際の計算過程を示すための図である。PAMでは、ルートノードから該当単語までに至るパスziの確率を計算し、同確率から得られる各トピックノードから単語wへ至る確率P(zi|w)を基にθ(D)とφ(W)を更新することになる。ここでθ(D)及びφ(W)はそれぞれ、潜在トピック分析における文書毎のトピック比率及びトピック毎の単語分布である。 FIG. 2 is a diagram for illustrating a calculation process when obtaining a relationship between each document and a topic in the hierarchical structure in PAM. In PAM, the probability of the path zi from the root node to the corresponding word is calculated, and θ (D) and φ (W (W) are calculated based on the probability P (zi | w) from each topic node to the word w obtained from the probability. ) Will be updated. Here, θ (D) and φ (W) are respectively the topic ratio for each document and the word distribution for each topic in the latent topic analysis.

図２の例にも示すように、パスziはルートノード、複数のスーパートピックノードs、複数のトピックノードκ及び複数の単語ノードνを辿ることから多数の組み合わせとして存在し、PAMでは多くの計算過程が必要となってしまう。なお、スーパートピックノードsが上位トピックに相当し、トピックノードκが下位トピックに相当する。 As shown in the example of FIG. 2, the path zi exists as a number of combinations because it follows a root node, a plurality of supertopic nodes s, a plurality of topic nodes κ, and a plurality of word nodes ν. A process is required. The super topic node s corresponds to the upper topic, and the topic node κ corresponds to the lower topic.

また、特許文献２では階層型トピックモデルを利用し、インタラクティブ検索に役立つ語を求めている。ただし、当該利用される階層型トピックモデルは既に構築されていることが前提となっており、実際に各文書に出現する単語とトピックの関係を求める方法については特許文献２では言及していない。従って、非特許文献２等の従来手法と同様に、階層型トピックで文書を分類する過程で多くの計算過程が必要となってしまう。 Patent Document 2 uses a hierarchical topic model to find words useful for interactive search. However, it is assumed that the hierarchical topic model to be used has already been constructed, and Patent Document 2 does not mention a method for obtaining the relationship between words and topics that actually appear in each document. Therefore, as in the conventional method such as Non-Patent Document 2, many calculation processes are required in the process of classifying documents by hierarchical topics.

上記従来技術の課題に鑑み、本発明は、階層型トピックモデルを高速に構築可能なモデル構築装置及びプログラムを提供することを目的とする。 In view of the above-described problems of the conventional technology, an object of the present invention is to provide a model construction apparatus and program capable of constructing a hierarchical topic model at high speed.

上記目的を達成するため、本発明は、バグオブワード表現された一連の対象データに対して潜在トピック分析を行い、当該分析結果をモデルとして出力するモデル構築装置であって、各対象データのトピック重み行列と、各単語のトピック重み行列と、に初期値を設定する初期設定部と、前記初期値を設定された各行列に逐次的にギブスサンプリングを行うことで、前記出力されるモデルとしての各行列を得る更新計算部と、を備え、前記出力するモデルにおいてはトピック同士の間にラティス構造のつながりが階層構造として与えられており、前記更新計算部は、前記逐次的にギブスサンプリングを行う際に、各対象データのトピック重み行列の各要素を対象として旧トピックから新トピックへの更新を行うに際して、当該新トピックの候補を前記ラティス構造において当該旧トピックとの距離が所定値以下となるものに限定して実施することを特徴とする。 In order to achieve the above object, the present invention is a model construction apparatus that performs latent topic analysis on a series of target data expressed in bugs of words, and outputs the analysis result as a model, the topic of each target data An initial setting unit that sets an initial value for a weight matrix, a topic weight matrix for each word, and Gibbs sampling sequentially for each matrix that is set with the initial value, as the output model An update calculation unit that obtains each matrix, and in the output model, a lattice structure is given as a hierarchical structure between topics, and the update calculation unit performs the Gibbs sampling sequentially When updating from the old topic to the new topic for each element of the topic weight matrix of each target data, Distance between the old topic which comprises carrying out in limited to a predetermined value or less in the lattice structure.

また、本発明は、コンピュータを前記モデル構築装置として機能させるプログラムであることを特徴とする。 In addition, the present invention is a program that causes a computer to function as the model construction device.

本発明によれば、トピック同士の間にラティス構造のつながりを階層構造として与え、ギブスサンプリング過程におけるトピック更新候補を当該階層構造上で距離が小さいトピックに限定するので、階層型トピックモデルを高速に構築することができる。 According to the present invention, a lattice structure connection is provided as a hierarchical structure between topics, and topic update candidates in the Gibbs sampling process are limited to topics with a short distance on the hierarchical structure. Can be built.

トピックの階層構造の例を示す図である。It is a figure which shows the example of the hierarchical structure of a topic. 階層構造を対象として各文書とトピックとの関係を求める際の計算過程を示すための図である。It is a figure for showing the calculation process at the time of calculating | requiring the relationship between each document and a topic for hierarchical structure. 一実施形態に係るモデル構築装置の機能ブロック図である。It is a functional block diagram of the model construction device concerning one embodiment. 通常のLDA等における行列θ(D)の初期化処理の例を示す図である。It is a figure which shows the example of the initialization process of matrix (D) in normal LDA etc. FIG. 通常のLDA等における行列φ(W)の初期化処理の例を示す図である。It is a figure which shows the example of the initialization process of matrix (phi) (W) in normal LDA etc. FIG. 通常のLDA等におけるギブスサンプリング過程のフローチャートである。It is a flowchart of the Gibbs sampling process in normal LDA etc. 図６における更新処理を説明するための例を示す図である。It is a figure which shows the example for demonstrating the update process in FIG. 階層構造として用いるラティス構造の例を示す図である。It is a figure which shows the example of the lattice structure used as a hierarchical structure.

図３は、一実施形態に係るモデル構築装置の機能ブロック図である。モデル構築装置10は、初期設定部1及び更新計算部2を備える。モデル構築装置10では、文書群を入力として読み込み、LDA等の潜在トピック分析を施した結果としてのモデルを出力する。 FIG. 3 is a functional block diagram of the model construction device according to an embodiment. The model construction device 10 includes an initial setting unit 1 and an update calculation unit 2. The model construction apparatus 10 reads a document group as an input, and outputs a model as a result of performing a latent topic analysis such as LDA.

本発明においては、当該出力するモデルを構築するための計算の枠組みとしては、非特許文献１に開示されトピック同士の階層構造を考えない通常のLDA等における計算の枠組みと共通のものを利用する。ここで特に、当該共通の枠組み内において詳細は後述するような手法で計算を行うことにより、階層構造を考えない通常のLDA等と同程度の計算負荷のもとで、階層構造を考慮したモデルを出力することが可能となる。すなわち、非特許文献２のPAMのように多数存在するパスに対して計算を繰り返すような大きな計算負荷を伴うことなく、PAMと同様に階層構造を考慮したモデルを出力することができる。このため、モデル構築装置10の各部1,2は具体的には以下のような処理を行う。 In the present invention, the calculation framework for constructing the output model is the same as the calculation framework disclosed in Non-Patent Document 1 and used in ordinary LDA or the like that does not consider the hierarchical structure between topics. . Here, in particular, a model that considers the hierarchical structure under the same computational load as ordinary LDA, etc. that does not consider the hierarchical structure, by calculating with the method described later in detail within the common framework. Can be output. That is, it is possible to output a model that considers the hierarchical structure in the same way as PAM without accompanying a large calculation load that repeats calculation for a large number of paths as in PAM of Non-Patent Document 2. Therefore, each unit 1 and 2 of the model construction device 10 specifically performs the following processing.

まず、初期設定部1では、各々がバグオブワードとして表現された文書群を読み込み、LDA等の通常のトピックモデル算出時におけるのと同様の初期化処理を行う。すなわち、文書ごと、出現する語ごとにトピックをランダムに割当てることで、各文書 d1, d2,…におけるトピックの重みを表す行列θ(D), D={d1,d2,…}と、各単語w1,w2,…におけるトピックの重みを表す行列φ(W), W={w1,w2, …}と、を初期値として設定し、当該初期値としての行列θ(D),φ(W)を更新計算部2に渡す。 First, the initial setting unit 1 reads a group of documents each expressed as a bug of word, and performs the same initialization process as when calculating a normal topic model such as LDA. That is, by assigning topics randomly to each document and every word that appears, a matrix θ (D), D = {d1, d2,…} representing the weight of the topic in each document d1, d2,. Matrix φ (W), W = {w1, w2,…} representing topic weights in w1, w2,... are set as initial values, and matrix θ (D), φ (W) as the initial values are set. To update calculation unit 2.

図４は、通常のLDA等における行列θ(D)の初期化処理の例を示す図である。(1)には、LDAの計算設定情報の例としてハイパーパラメータα,βをそれぞれ0.1, 0.01に設定する旨と、入力される文書群の例として2つの文書1,2における単語頻度（バグオブワード）の情報と、が示されている。 FIG. 4 is a diagram illustrating an example of the initialization process of the matrix θ (D) in a normal LDA or the like. (1) shows that the hyper parameters α and β are set to 0.1 and 0.01, respectively, as an example of LDA calculation setting information, and the word frequency (bug of (Word) information.

そして、(21),(31)と(22),(32)とに分けて示すように、文書1,文書2にはそれぞれ共通の処理がなされる。まず、(21),(22)に示すように、文書毎の単語出現数分だけ、ランダムにトピックを設定する。例えば、文書1に関しては(1)に示すように、単語Aが5語、単語Bが3語、単語Cが2語、それぞれ出現しているので、(21)及び以下に示すように同数分だけ4つのトピック[0],[1],[2],[3]の中からランダムに割り当てている。
単語Aに割り当てられた5つのトピック…[3], [0], [3], [1], [2]
単語Bに割り当てられた3つのトピック…[1], [0], [3]
単語Cに割り当てられた2つのトピック…[3], [1] Then, as shown separately in (21), (31) and (22), (32), the document 1 and the document 2 are processed in common. First, as shown in (21) and (22), topics are set at random for the number of words appearing for each document. For example, for document 1, as shown in (1), there are 5 words A, 3 words B, and 2 words C, respectively. Only 4 topics [0], [1], [2], [3] are randomly assigned.
Five topics assigned to word A… [3], [0], [3], [1], [2]
Three topics assigned to word B… [1], [0], [3]
Two topics assigned to word C… [3], [1]

なお、トピックについては当該[0],[1],[2],[3]のように適宜、「配列」の形式で表現するものとする。（当該表現は、後述する更新処理の説明を明確にするための表現である。） Note that topics are appropriately expressed in the form of “array” as in [0], [1], [2], [3]. (The expression is an expression for clarifying the description of the update process described later.)

次に、(31),(32)に示すように、当該ランダムに割り当てられたトピックの数を、トピックの種類毎に集計することでトピック比率を得る。例えば文書1に関しては(31)及び以下に示すように、トピック数が集計されてトピック比率が得られる。
文書1のトピック比率(topic0, topic1, topic2, topic3)=(2, 3, 1, 4) Next, as shown in (31) and (32), the topic ratio is obtained by counting the number of the randomly assigned topics for each topic type. For example, for document 1, as shown in (31) and below, the number of topics is aggregated to obtain the topic ratio.
Topic ratio of document 1 (topic0, topic1, topic2, topic3) = (2, 3, 1, 4)

最後に、当該文書毎のトピック比率を全文書に渡って行列形式で列挙することで、(4)に示すように、初期化された全文書のトピック比率θ(D)が得られる。 Finally, the topic ratios θ (D) of all initialized documents are obtained by enumerating the topic ratios for each document in a matrix format across all the documents, as shown in (4).

図５は、通常のLDA等における行列φ(W)の初期化処理の例を示す図であり、図４の例に対応する例を示す図である。まず、(21),(22)に示すように、θ(D)の初期化の際に用意した、文書毎の単語出現数分だけランダムにトピックを設定した情報が処理対象となる。すなわち、(21),(22)は図４及び図５で共通である。 FIG. 5 is a diagram illustrating an example of initialization processing of the matrix φ (W) in a normal LDA or the like, and is a diagram illustrating an example corresponding to the example of FIG. First, as shown in (21) and (22), information prepared at the initialization of θ (D), in which topics are set at random for the number of words appearing for each document, is processed. That is, (21) and (22) are common to FIGS.

次に、(51),(52)として示すように文書1,文書2においてそれぞれ、当該(21),(22)の情報を単語及びトピックごとに集計することで、文書ごとにトピック毎の単語分布φ(W)を得る。 Next, as shown in (51) and (52), the information for (21) and (22) is totaled for each word and topic in Document 1 and Document 2, respectively. Obtain the distribution φ (W).

例えば(51)では文書1における単語Aに関しては出現頻度が5回であり以下のようにランダムにトピックが割り当てられている。
単語Aに割り当てられた5つのトピック…[3], [0], [3], [1], [2] For example, in (51), the appearance frequency of word A in document 1 is 5 times, and topics are randomly assigned as follows.
Five topics assigned to word A… [3], [0], [3], [1], [2]

従って、(51)の表形式データ部分の１行目にあるように、文書1における単語Aのトピック割当回数は以下の通りとなる。
文書1の単語Aのトピック割当回数(topic0, topic1, topic2, topic3)=(1, 1, 1, 2) Accordingly, as shown in the first line of the tabular data portion of (51), the topic allocation count of word A in document 1 is as follows.
Number of topics assigned to word A in document 1 (topic0, topic1, topic2, topic3) = (1, 1, 1, 2)

そして、文書1の残りの単語B,C,Dについても同様に割当回数を求め、文書1の全単語の結果を集計することで(51)に示すように、単語分布が求まる。例えば、(51)の表形式データ部分の1列目にあるように、トピック0の単語分布は以下の通りとなる。
トピック0の単語分布(単語A, 単語B, 単語C, 単語D)=(1, 1, 0, 0) Then, the number of allocations is similarly obtained for the remaining words B, C, and D of the document 1, and the result of all the words of the document 1 is totaled to obtain the word distribution as shown in (51). For example, as shown in the first column of the tabular data portion of (51), the word distribution of topic 0 is as follows.
Topic 0 word distribution (word A, word B, word C, word D) = (1, 1, 0, 0)

最後に、(51),(52)の当該文書毎に得た単語分布を全文書に渡って集計することで、(6)に示すように初期化された全文書におけるトピック毎の単語分布φ(W)が得られる。 Finally, by summing up the word distribution obtained for each document in (51) and (52) over all documents, the word distribution for each topic in all documents initialized as shown in (6) φ (W) is obtained.

更新計算部2では、以上のように初期設定部1で得られた初期化されたθ(D)及びφ(W)に対して逐次的な更新処理を行うことで、モデル構築装置10からの出力としての最終的なθ(D)及びφ(W)を求める。ここで、当該更新処理の枠組み自体には、通常のLDA等においてなされているギブスサンプリング過程と共通のものを利用することができ、逐次的な各回の更新処理の内容に本発明特有の手法が利用される。 In the update calculation unit 2, by performing sequential update processing on the initialized θ (D) and φ (W) obtained by the initial setting unit 1 as described above, from the model construction device 10 Find the final θ (D) and φ (W) as outputs. Here, the update processing framework itself can use the same processing as the Gibbs sampling process performed in ordinary LDA, etc., and the method specific to the present invention is included in the contents of each successive update processing. Used.

従って、以下では通常のLDA等におけるギブスサンプリング過程をまず説明してから、本発明特有の更新処理について説明することとする。 Accordingly, in the following, the Gibbs sampling process in a normal LDA or the like will be described first, and then the update process unique to the present invention will be described.

図６は、通常のLDA等におけるギブスサンプリング過程のフローチャートであり、図７は当該フローチャートにおける更新処理を説明するための例を示す図である。なお、図７の例は、図４及び図５で示した例と対応している。 FIG. 6 is a flowchart of a Gibbs sampling process in a normal LDA or the like, and FIG. 7 is a diagram illustrating an example for explaining an update process in the flowchart. The example of FIG. 7 corresponds to the example shown in FIGS.

図６のフローを開始すると、まずステップS1において、当該フローにて更新処理の対象を制御するためのカウンタである文書diのカウンタiと、文書di内の単語出現回数のカウンタjと、ループ処理回数のカウンタkと、を初期値（一般に0又は1）に設定し、ステップS2に進む。 When the flow of FIG. 6 is started, first, in step S1, the counter i of the document di that is a counter for controlling the target of the update process in the flow, the counter j of the number of occurrences of words in the document di, and loop processing The number counter k is set to an initial value (generally 0 or 1), and the process proceeds to step S2.

ステップS2ではギブスサンプリングを行ってから、ステップS3へ進む。すなわち、当該時点におけるカウンタ(i, j)で指定される文書diにおけるj番目の単語に対応するトピックを、当該時点における行列φ(W),θ(D)に基づいて新たに決定し、当該新たに決定されたトピックに従って行列φ(W),θ(D)を更新する。当該ステップS2の詳細は図７を参照して後述する。 In step S2, Gibbs sampling is performed, and then the process proceeds to step S3. That is, a topic corresponding to the j-th word in the document di specified by the counter (i, j) at the time is newly determined based on the matrices φ (W), θ (D) at the time, The matrices φ (W) and θ (D) are updated according to the newly determined topic. Details of step S2 will be described later with reference to FIG.

ステップS3では、当該文書di内の全単語出現回数分の処理が完了したかをカウンタjの値によって判断し、完了していればカウンタjの値を初期値へと再設定したうえでステップS4へ進み、完了していなければステップS31へ進む。ステップS31ではカウンタjの値を1だけインクリメントしてからステップS2へ戻る。従って、当該ステップS3の判断により、図示するような各文書di毎のループ処理L3が構成されることとなる。 In step S3, it is determined from the value of the counter j whether processing for the number of occurrences of all words in the document di has been completed. If it has been completed, the value of the counter j is reset to the initial value, and step S4 is performed. If not completed, the process proceeds to step S31. In step S31, the value of the counter j is incremented by 1, and the process returns to step S2. Accordingly, the loop processing L3 for each document di as shown in the figure is configured by the determination in step S3.

ステップS4では、全文書diにつき処理が完了したかをカウンタiの値によって判断し、完了していればカウンタiの値を初期値へと再設定したうえでステップS5へと進み、完了していなければステップS41へ進む。ステップS41ではカウンタiの値を1だけインクリメントしてからステップS2へ戻る。従って、当該ステップS4の判断により、図示するような全文書di(i=1, 2, …)毎のループ処理L4が構成されることとなる。 In step S4, it is determined from the value of counter i whether processing has been completed for all documents di. If completed, the value of counter i is reset to the initial value, and then the process proceeds to step S5. If not, the process proceeds to step S41. In step S41, the value of the counter i is incremented by 1, and the process returns to step S2. Accordingly, the loop processing L4 for every document di (i = 1, 2,...) As shown in the figure is configured by the determination in step S4.

ステップS5では、当該図６のループ処理全体の完了条件が満たされたか否かを判断し、満たされていれば当該フローは終了し、満たされていなければステップS51へ進んでカウンタkの値を1だけインクリメントしてからステップS2へ戻る。従って、当該ステップS5の判断により、図示するような当該フロー全体としてのループ処理L5が構成されることとなる。 In step S5, it is determined whether or not the conditions for completion of the entire loop processing in FIG. 6 are satisfied. If satisfied, the flow ends. If not satisfied, the process proceeds to step S51 and the value of the counter k is set. After incrementing by 1, return to step S2. Accordingly, the loop processing L5 as the entire flow as shown in the figure is configured by the determination in step S5.

ここで、ループ処理全体の完了条件としては、カウンタkが所定値に到達していることや、直前のk-1回目で得られている行列φ(W),θ(D)と現時点のk回目で得られている行列φ(W),θ(D)との差分が所定値以下であること等を利用することができる。 Here, as the completion condition of the entire loop processing, the counter k has reached a predetermined value, the matrices φ (W), θ (D) obtained at the immediately preceding k−1th time, and the current k The fact that the difference from the matrix φ (W), θ (D) obtained in the first round is less than a predetermined value can be used.

以上のような図６のフローにより、ステップS5で完了条件が満たされたと判断された時点において保持している行列φ(W),θ(D)が、最終的な結果すなわちモデルとして出力されることとなる。 According to the flow of FIG. 6 as described above, the matrices φ (W) and θ (D) held at the time when it is determined that the completion condition is satisfied in step S5 are output as final results, that is, models. It will be.

ここで、図７を参照して図６のステップS2の処理、すなわち、ギブスサンプリングの詳細を説明する。図７では、図４及び図５で例として示した初期化された行列φ(W),θ(D)を対象として図６のフローを開始した直後のステップS2（すなわち初回のステップS2）の処理を例として説明する。図７では、その右側に図６と共通のループ処理L3,L4,L5が行われる旨を示している通り、当該初回以降の一般の場合も処理内容は同様である。 Here, the details of the processing in step S2 in FIG. 6, that is, Gibbs sampling will be described with reference to FIG. In FIG. 7, the step S2 (that is, the first step S2) immediately after starting the flow of FIG. 6 for the initialized matrices φ (W) and θ (D) shown as examples in FIGS. Processing will be described as an example. In FIG. 7, the processing contents are the same in the general cases after the first time as shown on the right side that the loop processing L3, L4, L5 common to FIG. 6 is performed.

初回処理であるので、図７の(21)（図４、図５における(21)と共通）に太字及び下線で強調表示しているように、文書1(カウンタi=初期値1)にその総単語数だけ割り当てられたトピック配列のうち最初のトピック(カウンタj=初期値1)である[3]が更新対象となる。そして、(7)に示すように、当該更新対象の[3]につき、新たなIDの決定処理すなわち新たなトピックへの置き換え処理が実施される。 Since this is the initial processing, as shown in bold and underlined in (21) in FIG. 7 (common with (21) in FIGS. 4 and 5), the document 1 (counter i = initial value 1) [3], which is the first topic (counter j = initial value 1) in the topic array assigned by the total number of words, is to be updated. Then, as shown in (7), a new ID determination process, that is, a replacement process with a new topic is performed for [3] to be updated.

当該(7)の決定処理においては、(510)及び(310)に示すように、当該時点における行列φ(W),θ(D)が参照される。なお、(510)は図５の(51)に示す行列φ(W)における参照箇所を、(310)は図４の(31)における参照箇所を、それぞれ表している。 In the determination process of (7), as shown in (510) and (310), the matrices φ (W) and θ (D) at that time are referred to. Note that (510) represents the reference location in the matrix φ (W) shown in (51) of FIG. 5, and (310) represents the reference location in (31) of FIG.

そして、(7)にて具体的に新たなIDを決定する際は、行列φ(W)を参照して、当該更新しようとしているトピック[3](j=1)が割り当てられている単語Aのトピック比率に従う確率でトピックを出力する確率変数を利用する。(7)の例では、以下のような条件付き確率となる。
P(topic0 | 単語A)=1/(1+2+1+3)
P(topic1 | 単語A)=2/(1+2+1+3)
P(topic2 | 単語A)=1/(1+2+1+3)
P(topic3 | 単語A)=3/(1+2+1+3) When a new ID is specifically determined in (7), the word A to which the topic [3] (j = 1) to be updated is assigned is referred to the matrix φ (W). A random variable that outputs topics with the probability of following the topic ratio is used. In the example of (7), the conditional probability is as follows.
P (topic0 | word A) = 1 / (1 + 2 + 1 + 3)
P (topic1 | word A) = 2 / (1 + 2 + 1 + 3)
P (topic2 | word A) = 1 / (1 + 2 + 1 + 3)
P (topic3 | word A) = 3 / (1 + 2 + 1 + 3)

すなわち、新たなトピックIDは上記のような行列φ(W)によって定まる、いわば「偏ったサイコロ」を振ることで決定される。ここでは、当該サイコロにより新たなIDが[2]となったものとして説明する。（なお一般には、当該決定した結果、新旧IDが同一となるような場合もある。） That is, a new topic ID is determined by rolling a “biased dice”, which is determined by the matrix φ (W) as described above. Here, it is assumed that the new ID is [2] by the dice. (In general, the new and old IDs may be the same as a result of the decision.)

(8)は、(7)の決定に従い、旧IDである[3]が新IDである[2]へと置き換えられることで、(21)の状態から更新されたトピックの配列が示されている。そして、当該更新されたトピックの配列により、(511)及び(311)に示すように行列φ(W)及びθ(D)も該当箇所がそれぞれ(510)及び(310)の状態から更新されることとなる。なお、当該行列φ(W)及びθ(D)の該当箇所の更新については、図４及び図５で説明した初期値の設定の際と同様の集計処理を行えばよい。(511)及び(311)の例では、トピック[3]が[2]へと置き換えられた結果、それぞれ更新前の(510)及び(310)よりも[3]の度数が1減り、[2]の度数が1増えている。 In (8), according to the decision in (7), [3], which is the old ID, is replaced with [2], which is the new ID, and the array of topics updated from the state in (21) is shown. Yes. Then, as shown in (511) and (311), the matrices φ (W) and θ (D) are also updated from the states of (510) and (310), respectively, according to the updated topic arrangement. It will be. Note that the updating of the corresponding part of the matrices φ (W) and θ (D) may be performed by the same aggregation process as that for the initial value setting described with reference to FIGS. In the example of (511) and (311), as a result of the topic [3] being replaced with [2], the frequency of [3] is reduced by 1 from (510) and (310) before the update, respectively, and [2 ] 'S frequency has increased by 1.

以上、図６及び図７を参照して通常のLDA等におけるギブスサンプリングを説明した。更新計算部2においては、図６と同様に逐次的に繰り返すフロー構造に従ってギブスサンプリングを実施するが、ステップS2における各回の更新処理に制約を加えることで、計算負荷を低減すると共に、最終的に構築されるモデルを階層構造を考慮したものとすることができる。具体的には、以下の（制約１）及び（制約２）を加える。 The Gibbs sampling in ordinary LDA or the like has been described above with reference to FIGS. The update calculation unit 2 performs Gibbs sampling according to a flow structure that repeats sequentially as in FIG. 6. However, by adding constraints to each update process in step S 2, the calculation load is reduced and finally The model to be constructed can take into account the hierarchical structure. Specifically, the following (Constraint 1) and (Constraint 2) are added.

（制約１）トピック数は2のべき乗に限定し、各トピックにそのIDを2進数表現したトピックラベルを与える。また、トピック間の階層構造として、当該2進数表現されたトピックラベル間のハミング距離が1となるノードどうしがエッジで結ばれており、ハミング距離がnであるノードどうしはn個のエッジを経由して到達できる（ホップ数がnである）ようなグラフとしてのラティス構造を採用する。ここで、ノード間にエッジがあることは当該ノードに対応するトピック間に階層性のつながりがあることを意味する。 (Constraint 1) Limit the number of topics to a power of 2, and give each topic a topic label that expresses the ID in binary. In addition, as a hierarchical structure between topics, nodes whose Hamming distance between the topic labels expressed in binary numbers is 1 are connected by edges, and nodes whose Hamming distance is n pass through n edges. The lattice structure as a graph that can be reached (the number of hops is n) is adopted. Here, the presence of an edge between nodes means that there is a hierarchical connection between topics corresponding to the node.

図８は、階層構造として用いる当該ラティス構造の例を示す図である。図８では、トピック数を2の4乗(=16)とした際に各トピックIDに付与されるトピックラベルでノードを表現したラティス構造が例として示されている。 FIG. 8 is a diagram illustrating an example of the lattice structure used as a hierarchical structure. FIG. 8 shows an example of a lattice structure in which nodes are represented by topic labels assigned to each topic ID when the number of topics is set to 2 to the fourth power (= 16).

なお、前述の説明においては、説明の流れの観点から言及を省略していたが、上記（制約１）におけるトピック数の制約に関しては、初期設定部1も従うこととなる。当該従ったうえでの行列φ(W)及びθ(D)の初期値設定は、前述の通り通常手法と共通である。 In the above description, reference has been omitted from the viewpoint of the description flow, but the initial setting unit 1 also follows the restriction on the number of topics in the above (Restriction 1). The initial value setting of the matrices φ (W) and θ (D) after that is the same as the normal method as described above.

（制約２）図７等で説明したトピックIDを旧IDから新IDへ置き換える処理において、新IDの候補（遷移先の候補）を通常手法のように全トピックとするのではなく、旧IDとの間で上記ラティス構造において所定のn回以下のホップ（ノード間の遷移）で到達できるもの（すなわちハミング距離が所定数n以下であるもの）に限定する。 (Restriction 2) In the process of replacing the topic ID described with reference to FIG. 7 from the old ID to the new ID, the new ID candidates (transition destination candidates) are not all topics as in the normal method. The lattice structure is limited to those that can be reached with a predetermined n or less hops (transition between nodes) (that is, the Hamming distance is a predetermined number n or less).

当該限定により、確率P(z|w)（ここでzはトピックID、wは単語）の算出が簡素化され、計算負荷を下げることができる。ここで、当該確率に関しては当該限定された範囲内において規格化して定めればよい。 This limitation simplifies the calculation of the probability P (z | w) (where z is a topic ID and w is a word), and can reduce the calculation load. Here, the probability may be standardized within the limited range.

図８のラティス構造を用いる場合であれば例えば、更新対象となっている旧IDのトピックラベルが「0100」であり新ID（遷移先）の候補をハミング距離が1以下のものとする場合、遷移先のトピックラベルは「0100」、「0000」、「1100」、「0110」又は「0101」の5通りに限定される。それぞれの確率に関しては行列φ(W)を参照して求まる以下の確率を規格化したものとすればよい。なお、wは当該更新対象となっている旧IDのトピックが割り当てられている単語である。
P(0100|w), P(0000|w), P(1100|w), P(0110|w), P(0101|w) If the lattice structure of FIG. 8 is used, for example, if the topic label of the old ID to be updated is “0100” and the candidate for the new ID (transition destination) is a Hamming distance of 1 or less, Transition destination topic labels are limited to five types: “0100”, “0000”, “1100”, “0110”, or “0101”. Regarding the respective probabilities, the following probabilities obtained by referring to the matrix φ (W) may be standardized. Note that w is a word to which the topic with the old ID to be updated is assigned.
P (0100 | w), P (0000 | w), P (1100 | w), P (0110 | w), P (0101 | w)

以上のモデル構築装置10によるモデル構築を第一実施形態とする。次に、当該モデル構築の別の一実施形態（第二実施形態）を説明する。 The model construction by the model construction apparatus 10 described above is the first embodiment. Next, another embodiment (second embodiment) of the model construction will be described.

第一実施形態では、ラティス構造でトピックIDを割り振ったうえで、初期設定部1が行列φ(W),θ(D)の初期値を設定し、更新計算部2が当該初期値をギブスサンプリングにより更新して最終的な行列φ(W),θ(D)を得た。これに対して、第二実施形態では、ラティス構造によるトピックIDの割り振りを逐次的に実施し、それぞれの割り振りに対して初期設定部1及び更新計算部2がモデル構築を実施するという流れで処理を行う。 In the first embodiment, after assigning topic IDs in a lattice structure, the initial setting unit 1 sets initial values of the matrices φ (W) and θ (D), and the update calculation unit 2 performs Gibbs sampling of the initial values. Updated to obtain the final matrices φ (W), θ (D). On the other hand, in the second embodiment, topic IDs are sequentially allocated by a lattice structure, and the initial setting unit 1 and the update calculation unit 2 perform model construction for each allocation. I do.

具体的には、i回目(i=1, 2, …, m)に割り振られるトピックIDの集合をG(i)と書くと、以下の関係（トピックが逐次追加されることで増えていく関係）があるように各集合G(i)を予め設定しておく。ここで、最後のm回目に割り振られるG(m)を、2のべき乗個分からなる第一実施形態で用いたラティス構造の全体とする。
G(1)⊂G(2)⊂G(3)⊂…⊂G(m-1)⊂G(m) Specifically, if the set of topic IDs assigned to the i-th (i = 1, 2,…, m) is written as G (i), the following relationship (a relationship that increases as topics are added sequentially) ) To set each set G (i) in advance. Here, G (m) allocated at the last m-th time is assumed to be the entire lattice structure used in the first embodiment composed of powers of 2.
G (1) ⊂G (2) ⊂G (3) ⊂… ⊂G (m-1) ⊂G (m)

そして、G(1), G(2), …, G(m)の順にそれぞれ、初期設定部1及び更新計算部2が第一実施形態と同様にしてモデル構築を実施し、最後のG(m)について得られた結果を最終的に構築されるモデルとして採用する。ここで、モデル構築時のギブスサンプリングによる計算は第一実施形態と同様であり、割り振られるトピックIDの集合がG(i)に限定される点のみが第一実施形態と異なる。 Then, in the order of G (1), G (2),..., G (m), the initial setting unit 1 and the update calculation unit 2 perform model construction in the same manner as in the first embodiment, and the last G ( The result obtained for m) is adopted as the model to be finally constructed. Here, calculation by Gibbs sampling at the time of model construction is the same as that of the first embodiment, and only the point that the set of allocated topic IDs is limited to G (i) is different from the first embodiment.

第二実施形態では特に、トピックIDの割り振りG(i)に対してモデル構築した結果の行列をφ(W)[i],θ(D)[i]とすると、当該i回目の結果φ(W)[i],θ(D)[i]を、次に処理するトピックIDの割り振りG(i+1)に対する初期設定部1での初期値として利用する。当該初期値の利用により、初期値をランダムに設定される場合よりも最終結果に近いものとして設定することができるので、最終結果へと収束するまでの計算回数が削減され、処理を高速化することができる。 Particularly in the second embodiment, if the matrix of the result of model construction for topic ID allocation G (i) is φ (W) [i], θ (D) [i], the i-th result φ ( W) [i], θ (D) [i] are used as initial values in the initial setting unit 1 for topic ID allocation G (i + 1) to be processed next. By using the initial value, the initial value can be set closer to the final result than when it is set at random, reducing the number of calculations until convergence to the final result and speeding up the processing. be able to.

なお、初回のG(1)でのモデル構築においては、初期値はランダムに設定する。また、初回のG(1)でのモデル構築においては、更新計算部2の処理を通常のLDA等と同様の処理としてもよい。すなわち、前述の（制約２）を省略してもよい。 In the initial model construction with G (1), the initial value is set at random. Further, in the first model construction with G (1), the processing of the update calculation unit 2 may be the same processing as that of normal LDA or the like. That is, the above (Constraint 2) may be omitted.

逐次的に増やしていくトピック「G(1)⊂G(2)⊂G(3)⊂…⊂G(m-1)⊂G(m)」については、例えば以下のように設定することができる。 For the topic “G (1) ⊂G (2) ⊂G (3) ⊂… ⊂G (m-1) ⊂G (m)” that increases sequentially, for example, it can be set as follows: .

まず、G(1)に関しては、トピック総数を2のn乗として与える当該nが偶数の場合、_nC_n/2(=n・(n-1)・(n-2)・・・(n/2+1) / (1・2・3・・・(n/2)))、nが奇数の場合 _nC_(n-1)/2 (=n・(n-1)・(n-2)・・・(n+1)/2 / (1・2・3・・・(n-1)/2))個のIDを選択することで構成する。図８の例(n=4)であれば、図中の中段部分に描かれており、以下に列挙する6個のIDによりG(1)を構成することができる。
G(1)={ 1100, 1010, 1001, 0110, 0101, 0011 } First, for G (1), if n is an even number that gives the total number of topics as the nth power of 2, _n C _{n / 2} (= n · (n-1) · (n-2) ... (n / 2 + 1) / (1 ・ 2 ・ 3 ・・・ (n / 2))), when n is an odd number _n C _{(n-1) / 2} (= n ・ (n-1) ・ (n- 2) ... (n + 1) / 2 / (1 · 2 · 3 ... (n-1) / 2)) IDs are selected. In the case of the example of FIG. 8 (n = 4), it is drawn in the middle part of the figure, and G (1) can be constituted by the six IDs listed below.
G (1) = {1100, 1010, 1001, 0110, 0101, 0011}

なお、_nC_iは一般に、2のn乗個のノードからなるラティス構造の各ノードを図８のようにn+1段に分けてグラフ表示した際の、i(i=0, 1, 2, …, n)段目のノード数に相当する。ここで、i段目(i=0,n段は1ノードのみであるため除く)の各ノード間はホップ数2でありエッジがないという関係がある。上記G(1)の選択は、当該n+1段のうちできる限り中央の段にあるノードを選択することに相当する。従って、nが奇数の場合は上記の_nC_(n-1)/2個に代えて_nC_(n+1)/2個としてもよい。 Note that _n C _i is generally i (i = 0, 1, 2) when each node of the lattice structure composed of 2 n nodes is divided into n + 1 stages as shown in FIG. ,…, N) corresponds to the number of nodes in the stage. Here, there is a relationship that the number of hops is 2 and there is no edge between the nodes in the i-th stage (except for i = 0 and n-th stage because there is only one node). The selection of G (1) corresponds to selecting a node in the middle stage as much as possible among the n + 1 stages. Accordingly, when n is an odd number, _n C _{(n + 1) / 2} may be used instead of the above _n C _{(n-1) / 2} .

以降のG(i)(i≧2)に関しては、G(i-1)に含まれるラベルIDに対してハミング距離=1で到達できるラベルIDを新たに追加することにより構成する。図８の例であれば、以下のようにG(2),G(3)を構成し、2の4乗=16個のラベルIDからなるラティス構造全体を含むG(3)へと到達させることができる。
G(2)= { 1110, 1101, 1011, 0111 }∪G(1)∪{ 1000, 0100, 0010, 0001 }
G(3)={1111}∪G(2)∪{0000} The subsequent G (i) (i ≧ 2) is configured by newly adding a label ID that can be reached with a Hamming distance = 1 with respect to the label ID included in G (i−1). In the example of FIG. 8, G (2) and G (3) are configured as follows, and reach G (3) including the entire lattice structure including 2 4 = 16 label IDs. be able to.
G (2) = {1110, 1101, 1011, 0111} ∪G (1) ∪ {1000, 0100, 0010, 0001}
G (3) = {1111} ∪G (2) ∪ {0000}

なお、上記に代えて、以降のG(i)(i≧2)に関しては、G(i-1)に含まれるラベルIDに対してハミング距離が所定数n以下（ノード間のホップ数が所定数n以下）で到達できるラベルIDを新たに追加することにより構成するものとしてもよい。上記は当該所定数n=1の例である。 Instead of the above, for the subsequent G (i) (i ≧ 2), the Hamming distance for the label ID included in G (i-1) is a predetermined number n or less (the number of hops between nodes is predetermined). It may be configured by newly adding a label ID that can be reached in a few n or less). The above is an example of the predetermined number n = 1.

以上、本発明によれば、階層の無い従来のギブスサンプリングを利用したトピックモデル構築手法の実装をわずかに変えることで階層型トピック分類の実装が実現できる。当該実装により、従来の階層型トピックモデルに比べ、θ(D)、φ(W)の更新処理を高速化できる。特に、第二実施形態におけるトピック分類の数を逐次増やしていく手法の場合、さらなる高速化が可能になる。 As described above, according to the present invention, the implementation of hierarchical topic classification can be realized by slightly changing the implementation of the topic model construction method using the conventional Gibbs sampling without hierarchy. This implementation can speed up the update process of θ (D) and φ (W) compared to the conventional hierarchical topic model. In particular, in the case of the method of sequentially increasing the number of topic classifications in the second embodiment, it is possible to further increase the speed.

以下、本発明における補足的事項を説明する。 Hereinafter, supplementary matters in the present invention will be described.

（１）第一、第二実施形態のいずれにおいても、モデル構築装置10が最終結果として出力するモデルにおけるラティス構造は、以上説明したように2のn乗のラベルIDを全て用いたものとする他にも、当該2のn乗だけ個数があるラベルIDのうち所定の一部分のみを用いるようにしてもよい。一部分のみ用いる場合であっても、ギブスサンプリング過程は以上説明したのと同様に可能である。また、第二実施形態におけるG(1), G(2), …, G(m)の設定も、最後のG(m)を当該所定の一部分のみからなるID集合となるように設定することで同様に可能である。 (1) In both the first and second embodiments, the lattice structure in the model output as the final result by the model construction device 10 uses all the 2n label IDs as described above. In addition, only a predetermined part of the label IDs having the number of 2 to the nth power may be used. Even when only a part is used, the Gibbs sampling process is possible as described above. Also, the setting of G (1), G (2),..., G (m) in the second embodiment is also set so that the last G (m) is an ID set consisting of only the predetermined part. Is possible as well.

（２）モデル構築装置10がモデル構築する対象としての入力データには、通常のテキストとして構成される各文書をバグオブワード表現(BoW表現)したものが利用できるほか、テキスト以外の任意の対象をバグオブワード表現したものも全く同様に利用することができる。例えば画像は周知のように、Bag of Visual Words（バグオブビジュアルワード）としてその特徴量を表現できるが、これはバグオブワード表現の一種であるので、当該バグオブビジュアルワード表現された画像を入力データとしてもよい。 (2) The input data to be model-constructed by the model building device 10 can be a bug-of-word expression (BoW expression) of each document configured as normal text, or any other target other than text The bug of word expression can be used in exactly the same way. For example, as is well known, the feature amount can be expressed as Bag of Visual Words, but this is a kind of bug of word expression, so input the image expressed in the bug of visual word. It may be data.

（３）本発明は、コンピュータをモデル構築装置10として機能させるプログラムとしても提供可能である。当該コンピュータは、CPU(中央演算装置)、メモリ及び各種I/Fといった周知のハードウェアで構成することができ、当該プログラムを読み込んで実行するCPUがモデル構築装置10の各部として機能することとなる。 (3) The present invention can also be provided as a program that causes a computer to function as the model construction device 10. The computer can be configured with known hardware such as a CPU (Central Processing Unit), a memory, and various I / Fs, and the CPU that reads and executes the program functions as each unit of the model construction device 10. .

10…モデル構築装置、1…初期設定部、2…更新計算部 10 ... Model building device, 1 ... Initial setting unit, 2 ... Update calculation unit

Claims

A model construction device that performs a latent topic analysis on a series of target data expressed in a bug of word and outputs the analysis result as a model,
An initial setting unit that sets initial values for the topic weight matrix of each target data and the topic weight matrix of each word;
An update calculation unit that obtains each matrix as the output model by sequentially performing Gibbs sampling on each matrix set with the initial value, and
In the output model, the lattice structure is given as a hierarchical structure between topics,
The update calculation unit, when performing the Gibbs sampling sequentially, updates the old topic candidate to the lattice when updating the old topic to the new topic for each element of the topic weight matrix of each target data. A model construction apparatus characterized in that the construction is limited to a structure in which the distance from the old topic is a predetermined value or less.

The lattice structure is given as a connection between topics by providing an edge between nodes where each topic is a node and the Hamming distance between labels in which the topic ID is expressed in binary is 1. The model construction apparatus according to claim 1.

The update calculation unit, when performing an update from the old topic to the new topic, limits the candidates for the new topic to those that can be reached from the old topic in the lattice structure with a predetermined number of hops or less. 3. The model construction apparatus according to claim 2, wherein the probability of selecting a candidate is obtained by normalization from a topic weight matrix of each target data at the time of updating.

In the initial setting unit and the update calculation unit, a set G (i) is sequentially applied to a set G (i) (i = 1, 2,..., M) of a series of topics having the relationship of the following expression (1). ), The result of the latent topic analysis under the topic specified by the set G (m) is output as the model. Item 4. The model construction device according to any one of Items 1 to 3.
G (1) ⊂G (2) ⊂… ⊂G (m)… Equation (1)

In the initial setting unit, the latent topic analysis is performed under the topic specified by the set G (i + 1), based on the result of the latent topic analysis performed under the topic specified by the set G (i). The model construction apparatus according to claim 4, wherein the model construction apparatus is used as an initial value when performing the operation.

The topic constituting the set G (i + 1) is obtained by adding a topic that can be reached with a predetermined number of hops or less in the lattice structure to the topic constituting the set G (i). The model construction device according to claim 4 or 5.

The model construction apparatus according to claim 1, wherein the series of target data expressed in the bug of word is a document or other than a document.

A program that causes a computer to function as the model construction device according to any one of claims 1 to 7.